File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/w06-1603_relat.xml
Size: 4,306 bytes
Last Modified: 2025-10-06 14:15:57
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1603"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Paraphrase Recognition via Dissimilarity Signi cance Classi cation</Title> <Section position="4" start_page="18" end_page="19" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> Possibly the simplest approach to PR is an information retrieval (IR) based bag-of-words strategy. This strategy calculates a cosine similarity score for the given sentence set, and if the similarity exceeds a threshold (either empirically determined or learned from supervised training data), the sentences are paraphrases. PR systems that can be broadly categorized as IR-based include (Corley and Mihalcea, 2005; Brockett and Dolan, 2005). In the former work, the authors de ned a directional similarity formula re ecting the semantic similarity of one text with respect to another. A word contributes to the directional similarity only when its counterpart has been identi ed in the opposing sentence. The associated word similarity scores, weighted by the word's speci city (represented as inverted document frequency, idf ), sum to make up the directional similarity. The mean of both directions is the overall similarity of the pair. Brockett and Dolan (2005) represented sentence pairs as a feature vector, including features (among others) for sentence length, edit distance, number of shared words, morphologically similar word pairs, synonym pairs (as suggested by WordNet and a semi-automatically constructed thesaurus). A support vector machine is then trained to learn the f+pp, ppg classi er.</Paragraph> <Paragraph position="1"> Strategies based on bags of words largely ignore the semantic interactions between words.</Paragraph> <Paragraph position="2"> Weeds et al. (2005) addressed this problem by utilizing parses for PR. Their system for phrasal paraphrases equates paraphrasing as distributional similarity of the partial sub-parses of a candidate text. Wu (2005)'s approach relies on the generative framework of Inversion Transduction Grammar (ITG) to measure how similar two sentences arrange their words based on edit distance.</Paragraph> <Paragraph position="3"> Barzilay and Lee (2003) proposed to apply multiple-sequence alignment (MSA) for traditional, sentence-level PR. Given multiple articles on a certain type of event, sentence clusters are rst generated. Sentences within the same cluster, presumably similar in structure and content, are then used to construct a lattice with backbone nodes corresponding to words shared by the majority and slots corresponding to different realization of arguments. If sentences from different clusters have shared arguments, the associated lattices are claimed to be paraphrase. Likewise, Shinyama et al. (2002) extracted paraphrases from similar news articles, but use shared named entities as an indication of paraphrasing. It should be noted that the latter two approaches are geared towards acquiring paraphrases rather than detecting them, and as such have the disadvantage of requiring a certain level of repetition among candidates for paraphrases to be recognized.</Paragraph> <Paragraph position="4"> All past approaches invariably aim at a proper similarity measure that accounts for all of the words in the sentences in order to make a judgment for PR. This is suitable for PR where input sentences are precisely equivalent semantically. However, for many people the notion of paraphrases also covers cases in which minor or irrelevant information is added or omitted in candidate sentences, as observed in the earlier example. Such extraneous content should not be a barrier to PR if the main concepts are shared by the sentences. Approaches that focus only on the similarity of shared contents may fail when the (human) criteria for PR include whether the unmatched content is signi cant or not. Correctly addressing this problem should increase accuracy.</Paragraph> <Paragraph position="5"> In addition, if extraneous portions of sentences can be identi ed, their confounding in uence on the sentence similarity judgment can be removed, leading to more accurate modeling of semantic similarity for both recognition and acquisition.</Paragraph> </Section> class="xml-element"></Paper>