File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1008_intro.xml
Size: 5,795 bytes
Last Modified: 2025-10-06 14:01:11
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1008"> <Title>Extracting Paraphrases from a Parallel Corpus</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Paraphrases are alternative ways to convey the same information. A method for the automatic acquisition of paraphrases has both practical and linguistic interest. From a practical point of view, diversity in expression presents a major challenge for many NLP applications. In multidocument summarization, identification of paraphrasing is required to find repetitive information in the input documents. In generation, paraphrasing is employed to create more varied and fluent text.</Paragraph> <Paragraph position="1"> Most current applications use manually collected paraphrases tailored to a specific application, or utilize existing lexical resources such as Word-Net (Miller et al., 1990) to identify paraphrases. However, the process of manually collecting paraphrases is time consuming, and moreover, the collection is not reusable in other applications. Existing resources only include lexical paraphrases; they do not include phrasal or syntactically based paraphrases.</Paragraph> <Paragraph position="2"> From a linguistic point of view, questions concern the operative definition of paraphrases: what types of lexical relations and syntactic mechanisms can produce paraphrases? Many linguists (Halliday, 1985; de Beaugrande and Dressler, 1981) agree that paraphrases retain &quot;approximate conceptual equivalence&quot;, and are not limited only to synonymy relations. But the extent of interchangeability between phrases which form paraphrases is an open question (Dras, 1999). A corpus-based approach can provide insights on this question by revealing paraphrases that people use.</Paragraph> <Paragraph position="3"> This paper presents a corpus-based method for automatic extraction of paraphrases. We use a large collection of multiple parallel English translations of novels1. This corpus provides many instances of paraphrasing, because translations preserve the meaning of the original source, but may use different words to convey the meaning. An example of parallel translations is shown in Figure 1. It contains two pairs of paraphrases: (&quot;burst into tears&quot;, &quot;cried&quot;) and (&quot;comfort&quot;, &quot;console&quot;).</Paragraph> <Paragraph position="4"> Emma burst into tears and he tried to comfort her, saying things to make her smile.</Paragraph> <Paragraph position="5"> Emma cried, and he tried to console her, adorning his words with puns.</Paragraph> <Paragraph position="6"> Our method for paraphrase extraction builds upon methodology developed in Machine Translation (MT). In MT, pairs of translated sentences from a bilingual corpus are aligned, and occurrence patterns of words in two languages in the text are extracted and matched using correlation measures. However, our parallel corpus is far from the clean parallel corpora used in MT. The 1Foreign sources are not used in our experiment.</Paragraph> <Paragraph position="7"> rendition of a literary text into another language not only includes the translation, but also restructuring of the translation to fit the appropriate literary style. This process introduces differences in the translations which are an intrinsic part of the creative process. This results in greater differences across translations than the differences in typical MT parallel corpora, such as the Canadian Hansards. We will return to this point later in Section 3.</Paragraph> <Paragraph position="8"> Based on the specifics of our corpus, we developed an unsupervised learning algorithm for paraphrase extraction. During the preprocessing stage, the corresponding sentences are aligned.</Paragraph> <Paragraph position="9"> We base our method for paraphrasing extraction on the assumption that phrases in aligned sentences which appear in similar contexts are paraphrases. To automatically infer which contexts are good predictors of paraphrases, contexts surrounding identical words in aligned sentences are extracted and filtered according to their predictive power. Then, these contexts are used to extract new paraphrases. In addition to learning lexical paraphrases, the method also learns syntactic paraphrases, by generalizing syntactic patterns of the extracted paraphrases. Extracted paraphrases are then applied to the corpus, and used to learn new context rules. This iterative algorithm continues until no new paraphrases are discovered.</Paragraph> <Paragraph position="10"> A novel feature of our approach is the ability to extract multiple kinds of paraphrases: Identification of lexical paraphrases. In contrast to earlier work on similarity, our approach allows identification of multi-word paraphrases, in addition to single words, a challenging issue for corpus-based techniques.</Paragraph> <Paragraph position="11"> Extraction of morpho-syntactic paraphrasing rules. Our approach yields a set of paraphrasing patterns by extrapolating the syntactic and morphological structure of extracted paraphrases.</Paragraph> <Paragraph position="12"> This process relies on morphological information and a part-of-speech tagging. Many of the rules identified by the algorithm match those that have been described as productive paraphrases in the linguistic literature.</Paragraph> <Paragraph position="13"> In the following sections, we provide an overview of existing work on paraphrasing, then we describe data used in this work, and detail our paraphrase extraction technique. We present results of our evaluation, and conclude with a discussion of our results.</Paragraph> </Section> class="xml-element"></Paper>