File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1051_evalu.xml
Size: 4,160 bytes
Last Modified: 2025-10-06 13:59:05
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1051"> <Title>Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources</Title> <Section position="6" start_page="5" end_page="5" type="evalu"> <SectionTitle> 5 Analysis/Discussion </SectionTitle> <Paragraph position="0"> To explore some of the differences between the training sets, we hand-examined a random sample of sentence pairs from each corpus type. The most common paraphrase alternations that we observed fell into the following broad categories: * Elaboration: Sentence pairs can differ in total information content, with an added word, phrase or clause in one sentence that has no counterpart in the other (e.g. the NASDAQ / the tech-heavy NASDAQ).</Paragraph> <Paragraph position="1"> * Phrasal: An entire group of words in one sentence alternates with one word or a phrase in the other. Some are non-compositional idioms (has pulled the plug on / is dropping plans for); others involve different phrasing (electronically / in electronic form, more than a million people / a massive crowd).</Paragraph> <Paragraph position="2"> * Spelling: British/American sources systematically differ in spellings of common words (colour / color); other variants also appear (email / e-mail).</Paragraph> <Paragraph position="3"> * Synonymy: Sentence pairs differ only in one or two words (e.g. charges / accusations), suggesting an editor's hand in modifying a single source sentence.</Paragraph> <Paragraph position="4"> * Anaphora: A full NP in one sentence corresponds to an anaphor in the other (Prime Minister Blair / He). Cases of NP anaphora (ISS / the Atlanta-based security company) are also common in the data, but in quantifying paraphrase types we restricted our attention to the simpler case of pronominal anaphora.</Paragraph> <Paragraph position="5"> * Reordering: Words, phrases, or entire constituents occur in different order in two related sentences, either because of major syntactic differences (e.g. topicalization, voice alternations) or more local pragmatic choices (e.g. adverb or prepositional phrase placement). These categories do not cover all possible alternations between pairs of paraphrased sentences; moreover, categories often overlap in the same sequence of words. It is common, for example, to find instances of clausal Reordering combined with Synonymy.</Paragraph> <Paragraph position="6"> Figure 2 shows a hand-aligned paraphrase pair taken from the F2 data. This pair displays one Spelling alternation (defence / defense), one Reordering (position of the &quot;since&quot; phrase), and one example of Elaboration (terror attacks occurs in only one sentence).</Paragraph> <Paragraph position="7"> To quantify the differences between L12 and F2, we randomly chose 100 sentence pairs from each dataset and counted the number of times each phenomenon was encountered. A given sentence pair might exhibit multiple instances of a single phenomenon, such as two phrasal paraphrase changes or two synonym replacements. In this case all instances were counted. Lower-frequency changes that fell outside of the above categories were not tallied: for example, the presence or absence of a definite article (had authority / had the authority) in Figure 2 was ignored. After summing all alternations in each sentence pair, we calculated the average number of occurrences of each paraphrase type in each data set. The results are shown in Table 2.</Paragraph> <Paragraph position="8"> Several major differences stand out between the two data sets. First, the F2 data is less parallel, as evidenced by the higher percentage of Elaborations found in those sentence pairs. Loss of parallelism, however, is offset by greater diversity of paraphrase types encountered in the F2 data.</Paragraph> <Paragraph position="9"> Phrasal alternations are more than 4x more common, and Reorderings occur over 20x more frequently. Thus while string difference methods may produce relatively clean training data, this is achieved at the cost of filtering out common (and interesting) paraphrase relationships.</Paragraph> </Section> class="xml-element"></Paper>