File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1051_intro.xml

Size: 2,964 bytes

Last Modified: 2025-10-06 14:02:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1051">
  <Title>Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources</Title>
  <Section position="3" start_page="2" end_page="2" type="intro">
    <SectionTitle>
2 Data/Methodology
</SectionTitle>
    <Paragraph position="0"> Our two paraphrase datasets are distilled from a corpus of news articles gathered from thousands of news sources over an extended period. While the idea of exploiting multiple news reports for paraphrase acquisition is not new, previous efforts (for example, Shinyama et al. 2002; Barzilay and Lee 2003) have been restricted to at most two news sources. Our work represents what we believe to be the first attempt to exploit the explosion of news coverage on the Web, where a single event can generate scores or hundreds of different articles within a brief period of time. Some of these articles represent minor rewrites of an original AP or Reuters story, while others represent truly distinct descriptions of the same basic facts. The massive redundancy of information conveyed with widely varying surface strings is a resource begging to be exploited.</Paragraph>
    <Paragraph position="1"> Figure 1 shows the flow of our data collection process. We begin with sets of pre-clustered URLs which point to news articles on the Web, representing thousands of different news sources.</Paragraph>
    <Paragraph position="2"> The clustering algorithm takes into account the full text of each news article, in addition to temporal cues, to produce a set of topically and temporally related articles. Our method is believed to be independent of the specific clustering technology used. The story text is isolated from a sea of advertisements and other miscellaneous text through use of a supervised HMM.</Paragraph>
    <Paragraph position="3"> Altogether we collected 11,162 clusters in an 8month period, assembling 177,095 articles with an average of 15.8 articles per cluster. The clusters are generally coherent in topic and focus. Discrete events like disasters, business announcements, and deaths tend to yield tightly focused clusters, while ongoing stories like the SARS crisis tend to produce less focused clusters. While exact duplicate articles are filtered out of the clusters, many slightly-rewritten variants remain.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.1 Extracting Sentential Paraphrases
</SectionTitle>
      <Paragraph position="0"> Two separate techniques were employed to extract likely pairs of sentential paraphrases from these clusters. The first used string edit distance, counting the number of lexical deletions and insertions needed to transform one string into another. The second relied on a discourse-based heuristic, specific to the news genre, to identify likely paraphrase pairs even when they have little superficial similarity.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML