File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1051_metho.xml

Size: 15,146 bytes

Last Modified: 2025-10-06 14:08:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1051">
  <Title>Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources</Title>
  <Section position="4" start_page="2" end_page="5" type="metho">
    <SectionTitle>
3 Levenshtein Distance
</SectionTitle>
    <Paragraph position="0"> A simple edit distance metric (Levenshtein 1966) was used to identify pairs of sentences within a cluster that are similar at the string level. First, each sentence was normalized to lower case and paired with every other sentence in the cluster.</Paragraph>
    <Paragraph position="1"> Pairings that were identical or differing only by punctuation were rejected, as were those where the shorter sentence in the pair was less than two thirds the length of the longer, this latter constraint in effect placing an upper bound on edit distance relative to the length of the sentence. Pairs that had been seen before in either order were also rejected.</Paragraph>
    <Paragraph position="2"> Filtered in this way, our dataset yields 139K non-identical sentence pairs at a Levenshtein distance of n [?] 12.</Paragraph>
    <Paragraph position="3">  Mean Levenshtein distance was 5.17, and mean sentence length was 18.6 words. We will refer to this dataset as L12.</Paragraph>
    <Paragraph position="4">  The second extraction technique was specifically intended to capture paraphrases which might contain very different sets of content words, word order, and so on. Such pairs are typically used to illustrate the phenomenon of paraphrase, but precisely because their surface dissimilarity renders automatic discovery difficult, they have generally not been the focus of previous computational approaches.</Paragraph>
    <Paragraph position="5"> In order to automatically identify sentence pairs of this type, we have attempted to take advantage of some of the unique characteristics of the dataset. The topical clustering is sufficiently precise to ensure that, in general, articles in the same cluster overlap significantly in overall semantic content. Even so, any arbitrary pair of sentences from different articles within a cluster is unlikely to exhibit a paraphrase relationship: The Phi-X174 genome is short and compact.</Paragraph>
    <Paragraph position="6"> This is a robust new step that allows us to make much larger pieces.</Paragraph>
    <Paragraph position="7"> To isolate just those sentence pairs that represent likely paraphrases without requiring significant string similarity, we exploited a common journalistic convention: the first sentence or two of  A maximum Levenshtein distance of 12 was selected for the purposes of this paper on the basis of experiments with corpora extracted at various edit distances.</Paragraph>
    <Paragraph position="8"> a newspaper article typically summarize its content. One might reasonably expect, therefore, that initial sentences from one article in a cluster will be paraphrases of the initial sentences in other articles in that cluster. This heuristic turns out to be a powerful one, often correctly associating sentences that are very different at the string level: In only 14 days, US researchers have created an artificial bacteria-eating virus from synthetic genes.</Paragraph>
    <Paragraph position="9"> An artificial bacteria-eating virus has been made from synthetic genes in the record time of just two weeks. Also consider the following example, in which related words are obscured by different parts of speech: Chosun Ilbo, one of South Korea's leading newspapers, said North Korea had finished developing a new ballistic missile last year and was planning to deploy it.</Paragraph>
    <Paragraph position="10"> The Chosun Ilbo said development of the new missile, with a range of up to %%number%% kilometres (%%number%% miles), had been completed and deployment was imminent.</Paragraph>
    <Paragraph position="11"> A corpus was produced by extracting the first two sentences of each article, then pairing these across documents within each cluster. We will refer to this collection as the F2 corpus. The combination of the first-two sentences heuristic plus topical article clusters allows us to take advantage of meta-information implicit in our corpus, since clustering exploits lexical information from the entire document, not just the few sentences that are our focus. The assumption that two first sentences are semantically related is thus based in part on linguistic information that is external to the sentences themselves.</Paragraph>
    <Paragraph position="12"> Sometimes, however, the strategy of pairing sentences based on their cluster and position goes astray. This would lead us to posit a paraphrase relationship where there is none: Terence Hope should have spent most of yesterday in hospital performing brain surgery.</Paragraph>
    <Paragraph position="13"> A leading brain surgeon has been suspended from work following a dispute over a bowl of soup.</Paragraph>
    <Paragraph position="14"> To prevent too high an incidence of unrelated sentences, one string-based heuristic filter was found useful: a pair is discarded if the sentences do not share at least 3 words of 4+ characters. This constraint succeeds in filtering out many unrelated pairs, although it can sometimes be too restrictive, excluding completely legitimate paraphrases: There was no chance it would endanger our planet, astronomers said.</Paragraph>
    <Paragraph position="15"> NASA emphasized that there was never danger of a collision.</Paragraph>
    <Paragraph position="16"> An additional filter ensured that the word count of the shorter sentence is at least one-half that of the longer sentence. Given the relatively long sentences in our corpus (average length 18.6 words), these filters allowed us to maintain a degree of semantic relatedness between sentences. Accordingly, the dataset encompasses many paraphrases that would have been excluded under a more stringent edit-distance threshold, for example, the following non-paraphrase pair that contain an element of paraphrase: A staggering %%number%% million Americans have been victims of identity theft in the last five years , according to federal trade commission survey out this week.</Paragraph>
    <Paragraph position="17"> In the last year alone, %%number%% million people have had their identity purloined.</Paragraph>
    <Paragraph position="18"> Nevertheless, even after filtering in these ways, a significant amount of unfiltered noise remains in the F2 corpus, which consisted of 214K sentence pairs. Out of a sample of 448 held-out sentence pairs, 118 (26.3%) were rated by two independent human evaluators as sentence-level paraphrases, while 151 (33.7%) were rated as partial paraphrases. The remaining ~40% were assessed as News article clusters: URLs  Thus, although the F2 data set is nominally larger than the L12 data set, when the noise factor is taken into account, the actual number of full paraphrase sentences in this data set is estimated to be in the region of 56K sentences, with a further estimated 72K sentences containing some paraphrase material that might be a potential source of alignment.</Paragraph>
    <Paragraph position="19"> Some of these relations captured in this data can be complex. The following pair, for example, would be unlikely to pass muster on edit distance grounds, but nonetheless contains an inversion of deep semantic roles, employing different lexical items.</Paragraph>
    <Paragraph position="20"> The Hartford Courant reported %%day%% that Tony Bryant said two friends were the killers.</Paragraph>
    <Paragraph position="21"> A lawyer for Skakel says there is a claim that the murder was carried out by two friends of one of Skakel's school classmates, Tony Bryan.</Paragraph>
    <Paragraph position="22"> The F2 data also retains pairs like the following that involve both high-level semantic alternations and long distance dependencies: Two men who robbed a jeweller's shop to raise funds for the Bali bombings were each jailed for %%number%% years by Indonesian courts today.</Paragraph>
    <Paragraph position="23"> An Indonesian court today sentenced two men to %%number%% years in prison for helping finance last year's terrorist bombings in Bali by robbing a jewelry store.</Paragraph>
    <Paragraph position="24"> These examples do not by any means exhaust the inventory of complex paraphrase types that are commonly encountered in the F2 data. We encounter, among other things, polarity alternations, including those involving long-distance dependencies, and a variety of distributed paraphrases, with alignments spanning widely separated elements.</Paragraph>
    <Section position="1" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
3.2 Word Error Alignment Rate
</SectionTitle>
      <Paragraph position="0"> An objective scoring function was needed to compare the relative success of the two data collection strategies sketched in 2.1.1 and 2.1.2. Which technique produces more data? Are the types of data significantly different in character or utility? In order to address such questions, we used word Alignment Error Rate (AER), a metric borrowed from the field of statistical machine translation (Och &amp; Ney 2003). AER measures how accurately an automatic algorithm can align words in corpus of parallel sentence pairs, with a human- null This contrasts with 16.7% pairs assessed as unrelated in a 10,000 pair sampling of the L12 data. tagged corpus of alignments serving as the gold standard. Paraphrase data is of course monolingual, but otherwise the task is very similar to the MT alignment problem, posing the same issues with one-to-many, many-to-many, and one/many-tonull word mappings. Our a priori assumption was that the lower the AER for a corpus, the more likely it would be to yield learnable information about paraphrase alternations.</Paragraph>
      <Paragraph position="1"> We closely followed the evaluation standards established in Melamed (2001) and Och &amp; Ney (2000, 2003). Following Och &amp; Ney's methodology, two annotators each created an initial annotation for each dataset, subcategorizing alignments as either SURE (necessary) or POSSIBLE (allowed, but not required). Differences were then highlighted and the annotators were asked to review these cases. Finally we combined the two annotations into a single gold standard in the following manner: if both annotators agreed that an alignment should be SURE, then the alignment was marked as sure in the gold-standard; otherwise the alignment was marked as POSSIBLE.</Paragraph>
      <Paragraph position="2"> To compute Precision, Recall, and Alignment Error Rate (AER) for the twin datasets, we used exactly the formulae listed in Och &amp; Ney (2003).</Paragraph>
      <Paragraph position="3"> Let A be the set of alignments in the comparison, S be the set of SURE alignments in the gold standard, and P be the union of the SURE and POSSIBLE alignments in the gold standard. Then we have:  We held out a set of news clusters from our training data and randomly extracted two sets of sentence pairs for blind evaluation. The first is a set of 250 sentence pairs extracted on the basis of an edit distance of 5 [?] n [?] 20, arbitrarily chosen to allow a range of reasonably divergent candidate pairs. These sentence pairs were checked by an independent human evaluator to ensure that they contained paraphrases before they were tagged for alignments. The second set comprised 116 sentence pairs randomly selected from the set of first-two sentence pairs. These were likewise handvetted by independent human evaluators. After an initial training pass and refinement of the linking specification, interrater agreement measured in terms of AER  was 93.1% for the edit distance test set versus 83.7% for the F2 test set, suggestive of the greater variability in the latter data set.</Paragraph>
    </Section>
    <Section position="2" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
3.3 Data Alignment
</SectionTitle>
      <Paragraph position="0"> Each corpus was used as input to the word alignment algorithms available in Giza++ (Och &amp; Ney 2000). Giza++ is a freely available implementation of IBM Models 1-5 (Brown et al.</Paragraph>
      <Paragraph position="1"> 1993) and the HMM alignment (Vogel et al. 1996), along with various improvements and modifications motivated by experimentation by Och &amp; Ney (2000). Giza++ accepts as input a corpus of sentence pairs and produces as output a Viterbi alignment of that corpus as well as the parameters for the model that produced those alignments.</Paragraph>
      <Paragraph position="2"> While these models have proven effective at the word alignment task (Mihalcea &amp; Pedersen 2003), there are significant practical limitations in their output. Most fundamentally, all alignments have either zero or one connection to each target word. Hence they are unable to produce the many-to-many alignments required to identify correspondences with idioms and other phrasal chunks.</Paragraph>
      <Paragraph position="3"> To mitigate this limitation on final mappings, we follow the approach of Och (2000): we align once in the forward direction and again in the backward direction. These alignments can subsequently be recombined in a variety of ways,  The formula for AER given here and in Och &amp; Ney (2003) is intended to compare an automatic alignment against a gold standard alignment. However, when comparing one human against another, both comparison and reference distinguish between SURE and POSSIBLE links. Because the AER is asymmetric (though each direction differs by less than 5%), we have presented the average of the directional AERs.</Paragraph>
      <Paragraph position="4"> such as union to maximize recall or intersection to maximize precision. Och also documents a method for heuristically recombining the unidirectional alignments intended to balance precision and recall. In our experience, many alignment errors are present in one side but not the other, hence this recombination also serves to filter noise from the process.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="5" end_page="5" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> Table 1 shows the results of training translation models on data extracted by both methods and then tested on the blind data. The best overall performance, irrespective of test data type, is achieved by the L12 training set, with an 11.58% overall AER on the 250 sentence pair edit distance test set (20.88% AER for non-identical words).</Paragraph>
    <Paragraph position="1"> The F2 training data is probably too sparse and, with 40% unrelated sentence pairs, too noisy to achieve equally good results; nevertheless the gap between the results for the two training data types is dramatically narrower on the F2 test data. The nearly comparable numbers for the two training data sets, at 13.2% and 14.7% respectively, suggest that the L12 training corpus provides no substantive advantage over the F2 data when tested on the more complex test data. This is particularly striking given the noise inherent in the F2 training data.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML