File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1603_intro.xml
Size: 3,745 bytes
Last Modified: 2025-10-06 14:03:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1603"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Paraphrase Recognition via Dissimilarity Signi cance Classi cation</Title> <Section position="3" start_page="0" end_page="18" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The task of sentence-level paraphrase recognition (PR) is to identify whether a set of sentences (typically, a pair) are semantically equivalent. In such a task, equivalence takes on a relaxed meaning, allowing sentence pairs with minor semantic differences to still be considered as paraphrases.</Paragraph> <Paragraph position="1"> PR can be thought of as synonym detection extended for sentences, and it can play an equally important role in natural language applications.</Paragraph> <Paragraph position="2"> As with synonym detection, applications such as summarization can bene t from the recognition and canonicalization of concepts and actions that are shared across multiple documents. Automatic construction of large paraphrase corpora could mine alternative ways to express the same concept, aiding machine translation and natural language generation applications.</Paragraph> <Paragraph position="3"> In our work on sentence-level PR, we have identi ed two main issues through observation of sample sentences. The rst is to identify all discrete information nuggets, or individual semantic content units, shared by the sentences. For a pair of sentences to be deemed a paraphrase, they must share a substantial amount of these nuggets. A trivial case is when both sentences are identical, word for word. However, paraphrases often employ different words or syntactic structures to express the same concept. Figure 1 shows two sentence pairs, in which the rst pair is a paraphrase while the second is not. The paraphrasing pair (also denoted paraphrasing as the +pp class) use different words. Focusing just on the matrix verbs, we note differences between injured and hurt . A paraphrase recognition system should be able to detect such semantic similarities (despite the different syntactic structures). Otherwise, the two sentences could look even less similar than two non-paraphrasing sentences, such as the two in the second pair. Also in the paraphrasing pair, the rst sentence includes an extra phrase Authorities said . Human annotators tend to regard the pair as a paraphrase despite the presence of this extra information nugget.</Paragraph> <Paragraph position="4"> This leads to the second issue: how to recognize when such extra information is extraneous with respect to the paraphrase judgment. Such paraphrases are common in daily life. In news articles describing the same event, paraphrases are widely used, possibly with extraneous information.</Paragraph> <Paragraph position="5"> We equate PR with solving these two issues, presenting a natural two-phase architecture. In the rst phase, the nuggets shared by the sentences are identi ed by a pairing process. In the second phase, any unpaired nuggets are classi ed as signi cant or not (leading to pp and +pp classi cations, respectively). If the sentences do not contain unpaired nuggets, or if all unpaired nuggets are insigni cant, then the sentences are considered paraphrases. Experiments on the widely-used MSR corpus (Dolan et al., 2004) show favorable results.</Paragraph> <Paragraph position="6"> We rst review related work in Section 2. We then present the overall methodology and describe the implemented system in Section 3. Sections 4 and 5 detail the algorithms for the two phases respectively. This is followed with our evaluation and discussion of the results.</Paragraph> </Section> class="xml-element"></Paper>