XML Viewer - w06-3803

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3803_metho.xml
Size: 18,346 bytes
Last Modified: 2025-10-06 14:10:58
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3803">
  <Title>Graph-Based Text Representation for Novelty Detection</Title>
  <Section position="3" start_page="17" end_page="17" type="metho">
    <SectionTitle>
2 Previous work
</SectionTitle>
    <Paragraph position="0"> There were 13 participants and 54 submitted runs for the 2004 TREC novelty track task 2. Each participant submitted up to five runs with different system configurations. Metrics and approaches varied widely, from purely string based approaches to systems that used sophisticated linguistic components for synonymy resolution, coreference resolution and named entity recognition. Many systems employed a thresholding approach to the task, defining a novelty metric and then determining a sentence to be novel if the threshold is exceeded (e.g. Blott et al. 2004, Zhang et al.</Paragraph>
    <Paragraph position="1"> 2004, Abdul-Jaleel et al. 2004, Eichmann et al.</Paragraph>
    <Paragraph position="2"> 2004, Erkan 2004). Thresholds are either determined on the 2003 data, are based on a notion of mean score, or are determined in an ad hoc manner  . Tomiyama et al (2004), similar to our approach, use an SVM classifier to make the binary classification of a sentence as novel or not. The baseline result for the 2004 task 2 was an average F-measure of 0.577. This baseline is  Unfortunately, some of the system descriptions are unclear about the exact rationale for choosing a particular threshold.</Paragraph>
    <Paragraph position="3"> achieved if all relevant sentences are categorized as novel. The difficulty of the novelty detection task is evident from the relatively low score achieved by even the best systems. The five best-performing runs were:  1. Blott et al. (2004) (Dublin City University): using a tf.idf based metric of &amp;quot;importance value&amp;quot; at an ad hoc threshold: 0.622.</Paragraph>
    <Paragraph position="4"> 2. Tomiyama et al. (2004) (Meiji University):  using an SVM classifier trained on 2003 data, features based on conceptual fuzzy sets derived from a background corpus: 0.619.</Paragraph>
    <Paragraph position="5">  3. Abdul-Jaleel et al. (2004) (UMass): using named entity recognition, using cosine similarity as a metric and thresholds derived from the 2003 data set: 0.618.</Paragraph>
    <Paragraph position="6"> 4. Schiffman and McKeown (2004)  (Columbia): using a combination of tests based on weights (derived from a background corpus) for previously unseen words with parameters trained on the 2003 data set, and taking into account the novelty status of the previous sentence: 0.617.</Paragraph>
    <Paragraph position="7"> 5. Tomiyama et al (2004) (Meiji University): slight variation of the system described above, with one of the features (scarcity measure) eliminated: 0.617.</Paragraph>
    <Paragraph position="8"> As this list shows, there was no clear tendency of any particular kind of approach outperforming others. Among the above four systems and five runs, there are thresholding and classification approaches, systems that use background corpora and conceptual analysis and systems that do not.</Paragraph>
  </Section>
  <Section position="4" start_page="17" end_page="18" type="metho">
    <SectionTitle>
3 Experimental setup
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="17" end_page="18" type="sub_section">
      <SectionTitle>
3.1 The task
</SectionTitle>
      <Paragraph position="0"> Task 2 of the 2004 novelty track is formulated as follows: Task 2: Given the relevant sentences in the complete document set (for a given topic), identify all novel sentences.</Paragraph>
      <Paragraph position="1"> The procedure is sequential on an ordered list of sentences per topic. For each Sentence S i the determination needs to be made whether it is novel given the previously seen sentences S</Paragraph>
      <Paragraph position="3"> The evaluation metric for the novelty track is F   measure, averaged over all 50 topics.</Paragraph>
    </Section>
    <Section position="2" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
3.2 Novelty detection as classification
</SectionTitle>
      <Paragraph position="0"> For the purpose of this paper we view novelty detection as a supervised classification task. While the supervised approach has its limitations in real-life scenarios where annotated data are hard to come by, it can serve as a testing ground for the question we are interested in: the evaluation of feature sets and text representations.</Paragraph>
      <Paragraph position="1"> At training time, a feature vector is created for each tagged sentence S and the set of sentences that comprise the already seen information that S is compared to. Features in the vector can be features of the tagged sentence, features of the set of sentences comprising the given background information and features that capture a relation between the tagged sentence and the set of background sentences. A classifier is trained on the set of resulting feature vectors. At evaluation time, a feature vector is extracted from the sentence to be evaluated and from the set of sentences that form the background knowledge. The classifier then determines whether, given the feature values of that vector, the sentence is more likely to be novel or not.</Paragraph>
      <Paragraph position="2"> We use the TREC 2003 data set for training, since it is close to the 2004 data set in its makeup. We train Support Vector Machines (SVMs) on the 2003 data, using the LibSVM tool (Chang and Lin 2001). Following the methodology outlined in Chang and Lin 2003, we use radial basis function (RBF) kernels and perform a grid search on two-fold cross validated results on the training set to identify optimal parameter settings for the penalty parameter C and the RBF parameter g .</Paragraph>
      <Paragraph position="3"> Continuously valued features are scaled to values between -1 and 1. The scaling range is determined on the training set and the same range is applied to the test set.</Paragraph>
      <Paragraph position="4"> The text was minimally preprocessed before extracting features: stop words were removed, tokens were lowercased and punctuation was stripped from the strings.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="18" end_page="21" type="metho">
    <SectionTitle>
4 Text representations and features
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
4.1 KL divergence as a feature
</SectionTitle>
      <Paragraph position="0"> Treating sentences as an unordered collection of terms, the information-theoretic metric of KL divergence (or relative entropy) has been successfully used to measure &amp;quot;distance&amp;quot; between documents by simply comparing the term distributions in a document compared to another document or set of documents. The notions of distance and novelty are closely related: if a new document is very distant from the collection of documents that has been seen previously, it is likely to contain new, previously unseen information. Gabrilovich et al. (2004), for example, report on a successful use of KL divergence for novelty detection. KL divergence is defined in Equation 1:  Equation 1: KL divergence.</Paragraph>
      <Paragraph position="1"> w belongs to the set of words that are shared between document d and document (set) R. p d and p R are the probability distributions of words in d and R, respectively. Both p d (w) and p R (w) need to  be non-zero in the equation above. We used simple add-one smoothing to ensure non-zero values. While it is conceivable that KL divergence could take into account other features than just bag-of-words information, we restrict ourselves to this particular use of the measure since it corresponds to the typical use in novelty detection.</Paragraph>
    </Section>
    <Section position="2" start_page="18" end_page="20" type="sub_section">
      <SectionTitle>
4.2 Term distance graphs: from text to
</SectionTitle>
      <Paragraph position="0"> graph without linguistic analysis KL divergence as described above treats a document or sentence as an unordered collection of words. Language obviously provides more structure than that. Linguistic resources can impose structure on a string of words through consultation of linguistic knowledge (either hand-coded or learned from a tagged corpus). Even without any outside knowledge, however, the order of words in a sentence provides a means to construct a highly connected undirected graph with the words as vertices. The intuition here is:  1. All words in a sentence have some relationship to all other words in the sentence, modulo a &amp;quot;window size&amp;quot; outside of which the relationship is not taken into consideration 2. The closer two words are to each other, the stronger their connection tends to be  It follows from (2) that weights on the edges will be inversely proportional to the distance between two words (vertices). In the remainder of the paper we will refer to these graphs as TD (term distance) graphs. Of course (1) and (2) are rough generalizations with many counterexamples, but without the luxury of linguistic analysis this seems to be a reasonable step to advance beyond simple bag-of-word assumptions. Multiple sentence graphs can then be combined into a highly connected graph to represent text. Mihalcea (2004) and Mihalcea and Tarau (2004) have successfully explored very similar graph representations for extractive summarization and key word extraction. In addition to distance, we also employ pointwise mutual information as defined in Equation 2 between two words/vertices to enter into the calculation of edge weight  . This combination of distance and a cooccurrence measure such as PMI is reminiscent of decaying language models, as described for IR, for example, in Gao et al. (2002)  . Cooccurrence is counted at the sentence level, i.e. ( , )P i j is estimated by the number of sentences that contain both terms w</Paragraph>
      <Paragraph position="2"> , and ()Pi and ()Pj are estimated by counting the total sentences containing w</Paragraph>
      <Paragraph position="4"> respectively. As the set of seen sentences grows and cooccurrence between words becomes more prevalent, PMI becomes more influential on edge weights, strengthening edges between words that have high PMI.</Paragraph>
      <Paragraph position="6"> between two terms i and j.</Paragraph>
      <Paragraph position="7">  This view is supported by examining dependency structures derived from the Penn Tree Bank and mapping the probability of a dependency to the distance between words. See also Eisner and Smith (2005) who explore this generalization for dependency parsing.</Paragraph>
      <Paragraph position="8">  We also computed results from a graph where the edge weight is determined only by term distance, without PMI. These results were consistently worse than the ones reported here.</Paragraph>
      <Paragraph position="9">  We are grateful to an anonymous reviewer for pointing this out. Formally, the weight wt for each edge in the graph is defined as in Equation 3, where d i,j is the distance between words w</Paragraph>
      <Paragraph position="11"> the pointwise mutual information between words</Paragraph>
      <Paragraph position="13"> , given the sentences seen so far. For the purpose of Equation 3 we ignored negative PMI values, i.e. we treated negative PMI values as 0.</Paragraph>
      <Paragraph position="15"> Equation 3: Assigning weight to an edge between two vertices.</Paragraph>
      <Paragraph position="16"> We imposed a &amp;quot;window size&amp;quot; as a limit on the maximum distance between two words to enter an edge relationship. Window size was varied between 3 and 8; on the training set a window size of 6 proved to be optimal.</Paragraph>
      <Paragraph position="17"> On a TD graph representation, we can calculate various features based on the strengths and number of connections between words. In novelty detection, we can model the growing store of background information by adding each &amp;quot;incoming&amp;quot; sentence graph to the existing background graph. If an &amp;quot;incoming&amp;quot; edge already exists in the background graph, the weight of the &amp;quot;incoming&amp;quot; edge is added to the existing edge weight.</Paragraph>
      <Paragraph position="18"> Figure 1 shows a subset of a TD graph for the first two sentences of topic N57. The visualization is generated by the Pajek tool (Bagatelj and</Paragraph>
    </Section>
    <Section position="3" start_page="20" end_page="21" type="sub_section">
      <SectionTitle>
4.3 Graph features
</SectionTitle>
      <Paragraph position="0"> In novelty detection, graph based features allow to assess the change a graph undergoes through the addition of a new sentence. The intuition behind these features is that the more a graph changes when a sentence is added, the more likely the added sentence is to contain novel information.</Paragraph>
      <Paragraph position="1"> After all, novel information may be conveyed even if the terms involved are not novel. Establishing a new relation (i.e. edge in the graph) between two previously seen terms would have exactly that effect: old terms conveying new information. KL divergence or any other measure of distributional similarity is not suited to capture this scenario. As an example consider a news story thread about a crime. The various sentences in the background information may mention the victim, multiple suspects, previous convictions, similar crimes etc.</Paragraph>
      <Paragraph position="2"> When a new sentence is encountered where one suspect's name is mentioned in the same sentence with the victim, at a close distance, none of these two terms are new. The fact that suspect and victim are mentioned in one sentence, however, may indicate a piece of novel information: a close relationship between the two that did not exist in the background story.</Paragraph>
      <Paragraph position="3"> We designed 21 graph based features, based on the following definitions: * Background graph: the graph representing the previously seen sentences.</Paragraph>
      <Paragraph position="4">  * G(S): the graph of the sentence that is currently being evaluated.</Paragraph>
      <Paragraph position="5"> * Reinforced background edge: an edge that exists both in the background graph and in G(S).</Paragraph>
      <Paragraph position="6"> * Added background edge: a new edge in G(S) that connects two vertices that already exist in the background graph.</Paragraph>
      <Paragraph position="7"> * New edge: an edge in G(S) that connects two previously unseen vertices.</Paragraph>
      <Paragraph position="8"> * Connecting edge: an edge in G(S) between a previously unseen vertex and a previously seen vertex.</Paragraph>
      <Paragraph position="9">  * ratio between the sum of weights on new edges and the sum of weights on added background edges * ratio between the sum of weights on new edges and the sum of weights on connecting edges * ratio between the sum of weights on added background edges and the sum of weights on connecting edges * ratio between sum of weights on added background edges and the sum of pre-existing weights on those edges * ratio between sum of weights on new edges and sum of weight on background edges * ratio between sum of weights added to reinforced background edges and sum of background weights * ratio between number of added background edges and reinforced background edges * number of background edges leading from those background vertices that have been connected to new vertices by G(S) We refer to this set of 21 features as simple graph features, to distinguish them from a second set of graph-based features that are based on TextRank.</Paragraph>
      <Paragraph position="10">  The TextRank metric, as described in Mihalcea and Tarau (2004) is inspired by the PageRank  metric which is used for web page ranking  .</Paragraph>
      <Paragraph position="11"> TextRank is designed to work well in text graph representations: it can take edge weights into account and it works on undirected graphs. TextRank calculates a weight for each vertex, based on Equation 4.</Paragraph>
      <Paragraph position="13"/>
      <Paragraph position="15"> by a single edge, wt xy is the weight of the edge between vertex x and vertex y, and d is a constant &amp;quot;dampening factor&amp;quot;, set at  . To calculate TR, an initial score of 1 is assigned to all vertices, and the formula is applied iteratively until the difference in scores between iterations falls below a threshold of 0.0001 for all vertices (as in Mihalcea and Tarau 2004). The TextRank score itself is not particularly enlightening for novelty detection. It measures the &amp;quot;importance&amp;quot; rather than the novelty of a vertex hence its usefulness in keyword extraction. We can, however, derive a number of features from the TextRank scores that measure the change in scores as a result of adding a sentence to the graph of the background information. The rationale is that the more the TextRank scores are &amp;quot;disturbed&amp;quot; by the addition of a new sentence, the more likely it is that the new sentence carries novel information. We normalize the TextRank scores by the number of vertices to obtain a probability distribution. The features we define on the basis of the (normalized) TextRank metric are:  1. sum of TR scores on the nodes of S, after adding S 2. maximum TR score on any nodes of S 3. maximum TR score on any background node before adding S 4. delta between 2 and 3 5. sum of TR scores on the background nodes (after adding S)  Erkan and Radev (2005) introduced LexRank where a graph representation of a set of sentences is derived from the cosine similarity between sentences. Kurland and Lee (2004) derive a graph representation for a set of documents by linking documents X and Y with edges weighted by the score that a language model trained on X assigns to Y.</Paragraph>
      <Paragraph position="16">  Following Mihalcea and Tarau (2004), who in turn base their default setting on Brin and Page (1998).</Paragraph>
      <Paragraph position="17"> 6. delta between 5 and 1 7. variance of the TR scores before adding S 8. variance of TR scores after adding S 9. delta between 7 and 8 10. ratio of 1 to 5 11. KL divergence between the TR scores before and after adding S</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML