File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2005_metho.xml

Size: 2,650 bytes

Last Modified: 2025-10-06 14:08:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2005">
  <Title>Story Link Detection and New Event Detection are Asymmetric</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Common Processing and Models
</SectionTitle>
    <Paragraph position="0"> The Link Detection and New Event Detection systems that we developed for TDT2002 share many processing steps in common. This includes preprocessing to tokenize the data, recognize abbreviations, normalize abbreviations, remove stop-words, replace spelled-out numbers by digits, add part-of-speech tags, replace the tokens by their stems, and then generating termfrequency vectors. Document frequency counts are incrementally updated as new sources of stories are presented to the system. Additionally, separate source-specific counts are used, so that, for example, the term frequencies for the New York Times are computed separately from stories from CNN. The sourcespecific, incremental, document frequency counts are used to compute a TF-IDF term vector for each story.</Paragraph>
    <Paragraph position="1"> Stories are compared using either the cosine distance</Paragraph>
    <Paragraph position="3"> terms a55 in documents a7 a9 and a7 a14 . To help compensate for stylistic differences between various sources, e.g., news paper vs. broadcast news, translation errors, and automatic speech recognition errors (Allan et al., 1999), we subtract the average observed similarity values, in similar spirit to the use of thresholds conditioned on the sources (Carbonell et al., 2001)</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 New Event Detection
</SectionTitle>
    <Paragraph position="0"> In order to decide whether a new document a7 describes a new event, it is compared to all previous documents and the document a7a57a56 with highest similarity is identified. If the score a0a17a58a31a59a39a60a12a61a10a5a45a7a10a16a62a18a64a63a66a65a67a0a2a1a4a3a44a5a45a7a57a11a13a7a68a56a2a16 exceeds a threshold a69a25a70 , then there is no sufficiently similar previous document, and a7 is classified as a new event.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Link Detection
</SectionTitle>
    <Paragraph position="0"> In order to decide whether a pair of stories a7a71a9 and a7a10a14 are linked, we compute the similarity between the two documents using the cosine and Hellinger metrics. The similarity metrics are combined using a support vector machine and the margin is used as a confidence measure that is thresholded.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML