File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3803_intro.xml

Size: 4,077 bytes

Last Modified: 2025-10-06 14:04:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3803">
  <Title>Graph-Based Text Representation for Novelty Detection</Title>
  <Section position="2" start_page="0" end_page="17" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Novelty detection is the task of identifying novel information given a set of already accumulated background information. Potential applications of novelty detection systems are abundant, given the &amp;quot;information overload&amp;quot; in email, web content etc. Gabrilovich et al (2004), for example, describe a scenario in which a newsfeed is personalized based on a measure of information novelty: the user can be presented with pieces of information that are novel, given the documents that have already been reviewed. This will spare the user the task of sifting through vast amounts of duplicate and redundant information on a topic to find bits and pieces of information that are of interest.</Paragraph>
    <Paragraph position="1"> In 2002 TREC introduced a novelty track (Harman 2002), which continued -- with major changes -- in 2003 (Soboroff and Harman 2003) and 2004 (Voorhees 2004). In 2002 the task was to identify the set of relevant and novel sentences from an ordered set of documents within a TREC topic. Novelty was defined as &amp;quot;providing new information that has not been found in any previously picked sentences&amp;quot;. Relevance was defined as &amp;quot;relevant to the question or request made in the description section of the topic&amp;quot;. Inter-annotator agreement was low (Harman 2002).</Paragraph>
    <Paragraph position="2"> There were 50 topics for the novelty task in 2002.</Paragraph>
    <Paragraph position="3"> For the 2003 novelty track a number of major changes were made. Relevance and novelty detection were separated into different tasks, allowing a separate evaluation of relevance detection and novelty detection. In the 2002 track, the data proved to be problematic since the percentage of relevant sentences in the documents was small. This, in turn, led to a very high percentage of relevant sentences being novel, given that amongst the small set of relevant sentences there was little redundancy. 50 new topics were created for the 2003 task, with a better balance of relevant and novel sentences. Slightly more than half of the topics dealt with &amp;quot;events,&amp;quot; the rest with &amp;quot;opinions.&amp;quot; The 2004 track used the same tasks, the same number of topics and the same split between event and opinion topics as the 2003 track.</Paragraph>
    <Paragraph position="4"> For the purpose of this paper, we are only concerned with novelty detection, specifically with task 2 of the 2004 novelty track, as described in more detail in the following section.</Paragraph>
    <Paragraph position="5">  The question that we investigate here is: what is a meaningful feature set for text representation for novelty detection? This is obviously a far-reaching and loaded question. Possibilities range from simple bag-of-word features to features derived from sophisticated linguistic representations.</Paragraph>
    <Paragraph position="6"> Ultimately, the question is open-ended since there will always be another feature or feature combination that could/should be exploited. For our experiments, we have decided to focus more narrowly on the usefulness of features derived from graph representations and we have restricted ourselves to representations that do not require linguistic analysis. Simple bag-of-word metrics like KL divergence establish a baseline for classifier performance. More sophisticated metrics can be defined on the basis of graph representations. Graph representations of text can be constructed without performing linguistic analysis, by using term distances in sentences and pointwise mutual information between terms to form edges between term-vertices. A term-distance based representation has been used successfully for a variety of tasks in Mihalcea (2004) and Mihalcea and Tarau (2004).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML