File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1008_metho.xml
Size: 22,389 bytes
Last Modified: 2025-10-06 14:07:38
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1008"> <Title>Extracting Paraphrases from a Parallel Corpus</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Data </SectionTitle> <Paragraph position="0"> The corpus we use for identification of paraphrases is a collection of multiple English translations from a foreign source text. Specifically, we use literary texts written by foreign authors.</Paragraph> <Paragraph position="1"> Many classical texts have been translated more than once, and these translations are available on-line. In our experiments we used 5 books, among them, Flaubert's Madame Bovary, Andersen's Fairy Tales and Verne's Twenty Thousand Leagues Under the Sea. Some of the translations were created during different time periods and in different countries. In total, our corpus contains 11 translations 2.</Paragraph> <Paragraph position="2"> At first glance, our corpus seems quite similar to parallel corpora used by researchers in MT, such as the Canadian Hansards. The major distinction lies in the degree of proximity between the translations. Analyzing multiple translations of the literary texts, critics (e.g. (Wechsler, 1998)) have observed that translations &quot;are never identical&quot;, and each translator creates his own interpretations of the text. Clauses such as &quot;adorning his words with puns&quot; and &quot;saying things to make her smile&quot; from the sentences in Figure 1 are examples of distinct translations. Therefore, a complete match between words of related sentences is impossible. This characteristic of our corpus is similar to problems with noisy and comparable corpora (Veronis, 2000), and it prevents us from using methods developed in the MT community based on clean parallel corpora, such as (Brown et al., 1993).</Paragraph> <Paragraph position="3"> Another distinction between our corpus and parallel MT corpora is the irregularity of word matchings: in MT, no words in the source language are kept as is in the target language translation; for example, an English translation of 2Free of copyright restrictions part of our corpus(9 translations) is available at http://www.cs.columbia.edu/~regina /par.</Paragraph> <Paragraph position="4"> a French source does not contain untranslated French fragments. In contrast, in our corpus the same word is usually used in both translations, and only sometimes its paraphrases are used, which means that word-paraphrase pairs will have lower co-occurrence rates than word-translation pairs in MT. For example, consider occurrences of the word &quot;boy&quot; in two translations of &quot;Madame Bovary&quot; -- E. Marx-Aveling's translation and Etext's translation. The first text contains 55 occurrences of &quot;boy&quot;, which correspond to 38 occurrences of &quot;boy&quot; and 17 occurrences of its paraphrases (&quot;son&quot;, &quot;young fellow&quot; and &quot;youngster&quot;). This rules out using word translation methods based only on word co-occurrence counts.</Paragraph> <Paragraph position="5"> On the other hand, the big advantage of our corpus comes from the fact that parallel translations share many words, which helps the matching process. We describe below a method of paraphrase extraction, exploiting these features of our corpus.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Preprocessing </SectionTitle> <Paragraph position="0"> During the preprocessing stage, we perform sentence alignment. Sentences which are translations of the same source sentence contain a number of identical words, which serve as a strong clue to the matching process. Alignment is performed using dynamic programming (Gale and Church, 1991) with a weight function based on the number of common words in a sentence pair. This simple method achieves good results for our corpus, because 42% of the words in corresponding sentences are identical words on average. Alignment produces 44,562 pairs of sentences with 1,798,526 words. To evaluate the accuracy of the alignment process, we analyzed 127 sentence pairs from the algorithm's output. 120(94.5%) alignments were identified as correct alignments.</Paragraph> <Paragraph position="1"> We then use a part-of-speech tagger and chunker (Mikheev, 1997) to identify noun and verb phrases in the sentences. These phrases become the atomic units of the algorithm. We also record for each token its derivational root, using the CELEX(Baayen et al., 1993) database.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Method for Paraphrase Extraction </SectionTitle> <Paragraph position="0"> Given the aforementioned differences between translations, our method builds on similarity in the local context, rather than on global alignment.</Paragraph> <Paragraph position="1"> Consider the two sentences in Figure 2.</Paragraph> <Paragraph position="2"> And finally, dazzlingly white, it shone high above them in the empty ? .</Paragraph> <Paragraph position="3"> It appeared white and dazzling in the empty ? . Analyzing the contexts surrounding &quot; ? &quot;marked blanks in both sentences, one expects that they should have the same meaning, because they have the same premodifier &quot;empty&quot; and relate to the same preposition &quot;in&quot; (in fact, the first &quot; ? &quot; stands for &quot;sky&quot;, and the second for &quot;heavens&quot;). Generalizing from this example, we hypothesize that if the contexts surrounding two phrases look similar enough, then these two phrases are likely to be paraphrases. The definition of the context depends on how similar the translations are. Once we know which contexts are good paraphrase predictors, we can extract paraphrase patterns from our corpus.</Paragraph> <Paragraph position="4"> Examples of such contexts are verb-object relations and noun-modifier relations, which were traditionally used in word similarity tasks from non-parallel corpora (Pereira et al., 1993; Hatzivassiloglou and McKeown, 1993). However, in our case, more indirect relations can also be clues for paraphrasing, because we know a priori that input sentences convey the same information. For example, in sentences from Figure 3, the verbs &quot;ringing&quot; and &quot;sounding&quot; do not share identical subject nouns, but the modifier of both subjects &quot;Evening&quot; is identical. Can we conclude that identical modifiers of the subject imply verb similarity? To address this question, we need a way to identify contexts that are good predictors for paraphrasing in a corpus.</Paragraph> <Paragraph position="5"> People said &quot;The Evening Noise is sounding, the sun is setting.&quot; &quot;The evening bell is ringing,&quot; people used to say. To find &quot;good&quot; contexts, we can analyze all contexts surrounding identical words in the pairs of aligned sentences, and use these contexts to learn new paraphrases. This provides a basis for a bootstrapping mechanism. Starting with identical words in aligned sentences as a seed, we can incrementally learn the &quot;good&quot; contexts, and in turn use them to learn new paraphrases. Identical words play two roles in this process: first, they are used to learn context rules; second, identical words are used in application of these rules, because the rules contain information about the equality of words in context.</Paragraph> <Paragraph position="6"> This method of co-training has been previously applied to a variety of natural language tasks, such as word sense disambiguation (Yarowsky, 1995), lexicon construction for information extraction (Riloff and Jones, 1999), and named entity classification (Collins and Singer, 1999). In our case, the co-training process creates a binary classifier, which predicts whether a given pair of phrases makes a paraphrase or not.</Paragraph> <Paragraph position="7"> Our model is based on the DLCoTrain algorithm proposed by (Collins and Singer, 1999), which applies a co-training procedure to decision list classifiers for two independent sets of features. In our case, one set of features describes the paraphrase pair itself, and another set of features corresponds to contexts in which paraphrases occur. These features and their computation are described below.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Feature Extraction </SectionTitle> <Paragraph position="0"> Our paraphrase features include lexical and syntactic descriptions of the paraphrase pair. The lexical feature set consists of the sequence of tokens for each phrase in the paraphrase pair; the syntactic feature set consists of a sequence of part-of-speech tags where equal words and words with the same root are marked. For example, the value of the syntactic feature for the pair (&quot;the vast chimney&quot;, &quot;the chimney&quot;) is (&quot;DTa2 JJ NNa3 &quot;, &quot;DTa2 NNa3 &quot;), where indices indicate word equalities. We believe that this feature can be useful for two reasons: first, we expect that some syntactic categories can not be paraphrased in another syntactic category. For example, a determiner is unlikely to be a paraphrase of a verb. Second, this description is able to capture regularities in phrase level paraphrasing. In fact, a similar representation was used by (Jacquemin et al., 1997) to describe term variations.</Paragraph> <Paragraph position="1"> The contextual feature is a combination of the left and right syntactic contexts surrounding actual known paraphrases. There are a number of context representations that can be considered as possible candidates: lexical n-grams, POS-ngrams and parse tree fragments. The natural choice is a parse tree; however, existing parsers perform poorly in our domain3. Part-of-speech tags provide the required level of abstraction, and can be accurately computed for our data. The left (right) context is a sequence of part-of-speech tags of a4 words, occurring on the left (right) of the paraphrase. As in the case of syntactic paraphrase features, tags of identical words are marked. For example, when a4a6a5 a7 , the contextual feature for the paraphrase pair (&quot;comfort&quot;, &quot;console&quot;) from Figure 1 sentences is lefta2 =&quot;VBa2 TOa3 &quot;, (&quot;tried to&quot;), lefta3 =&quot;VBa2 TOa3 &quot;, (&quot;tried to&quot;), righta2 =&quot;PRP$a8 ,a9 &quot;, (&quot;her,&quot;) right context$a3 =&quot;PRP$a8 ,a9 &quot;, (&quot;her,&quot;). In the next section, we describe how the classifiers for contextual and paraphrasing features are co-trained.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 The co-training algorithm </SectionTitle> <Paragraph position="0"> Our co-training algorithm has three stages: initialization, training of the contextual classifier and training of the paraphrasing classifiers.</Paragraph> <Paragraph position="1"> Initialization Words which appear in both sentences of an aligned pair are used to create the initial &quot;seed&quot; rules. Using identical words, we create a set of positive paraphrasing examples, such as worda2 =tried, worda3 =tried. However, training of the classifier demands negative examples as well; in our case it requires pairs of words in aligned sentences which are not paraphrases of each other. To find negative examples, we match identical words in the alignment against all different words in the aligned sentence, assuming that identical words can match only each other, and not any other word in the aligned sentences. For example, &quot;tried&quot; from the first sentence in Figure 1 does not correspond to any other word in the second sentence but &quot;tried&quot;. Based on this observation, we can derive negative examples such as worda2 =tried, worda3 =Emma and worda2 =tried, worda3 =console. Given a pair of identical words from two sentences of length a4 and a10 , the algorithm produces one positive ex3To the best of our knowledge all existing statistical parsers are trained on WSJ or similar type of corpora. In the experiments we conducted, their performance significantly degraded on our corpus -- literary texts.</Paragraph> <Paragraph position="2"> ample and a11a12a4a14a13a16a15a18a17a20a19a21a11a22a10a23a13a21a15a18a17 negative examples. Training of the contextual classifier Using this initial seed, we record contexts around positive and negative paraphrasing examples. From all the extracted contexts we must identify the ones which are strong predictors of their category.</Paragraph> <Paragraph position="3"> Following (Collins and Singer, 1999), filtering is based on the strength of the context and its frequency. The strength of positive context a24 is defined as a25a27a26a18a28a30a29a32a31a33a11a22a24a34a19a35a17a37a36a18a25a27a26a18a28a30a29a32a31a33a11a22a24a34a17 , where a25a38a26a39a28a40a29a41a31a42a11a12a24a43a19a44a17 is the number of times context a24 surrounds positive examples (paraphrase pairs) and a25a38a26a39a28a40a29a41a31a42a11a12a24a43a17 is the frequency of the context a24 . Strength of the negative context is defined in a symmetrical manner. For the positive and the negative categories we select a45 rules (a45a46a5a47a15a18a48 in our experiments) with the highest frequency and strength higher than the predefined threshold of 95%. Examples of selected context rules are shown in Figure 4.</Paragraph> <Paragraph position="4"> The parameter of the contextual classifier is a context length. In our experiments we found that a maximal context length of three produces best results. We also observed that for some rules a shorter context works better. Therefore, when recording contexts around positive and negative examples, we record all the contexts with length smaller or equal to the maximal length.</Paragraph> <Paragraph position="5"> Because our corpus consists of translations of several books, created by different translators, we expect that the similarity between translations varies from one book to another. This implies that contextual rules should be specific to a particular pair of translations. Therefore, we train the contextual classifier for each pair of translations separately. null</Paragraph> <Paragraph position="7"> Training of the paraphrasing classifier Context rules extracted in the previous stage are then applied to the corpus to derive a new set of pairs of positive and negative paraphrasing examples.</Paragraph> <Paragraph position="8"> Applications of the rule performed by searching sentence pairs for subsequences which match the left and right parts of the contextual rule, and are less than a53 tokens apart. For example, applying the first rule from Figure 4 to sentences from Figure 1 yields the paraphrasing pair (&quot;comfort&quot;, &quot;console&quot;). Note that in the original seed set, the left and right contexts were separated by one token. This stretch in rule application allows us to extract multi-word paraphrases.</Paragraph> <Paragraph position="9"> For each extracted example, paraphrasing rules are recorded and filtered in a similar manner as contextual rules. Examples of lexical and syntactic paraphrasing rules are shown in Figure 5 and in Figure 6. After extracted lexical and syntactic paraphrases are applied to the corpus, the contextual classifier is retrained. New paraphrases not only add more positive and negative instances to the contextual classifier, but also revise contextual rules for known instances based on new paraphrase information.</Paragraph> <Paragraph position="10"> gorithm.</Paragraph> <Paragraph position="11"> The iterative process is terminated when no new paraphrases are discovered or the number of iterations exceeds a predefined threshold.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 The results </SectionTitle> <Paragraph position="0"> Our algorithm produced 9483 pairs of lexical paraphrases and 25 morpho-syntactic rules. To evaluate the quality of produced paraphrases, we picked at random 500 paraphrasing pairs from the lexical paraphrases produced by our algorithm.</Paragraph> <Paragraph position="1"> These pairs were used as test data and also to evaluate whether humans agree on paraphrasing judgments. The judges were given a page of guidelines, defining paraphrase as &quot;approximate conceptual equivalence&quot;. The main dilemma in designing the evaluation is whether to include the context: should the human judge see only a paraphrase pair or should a pair of sentences containing these paraphrases also be given? In a similar MT task -- evaluation of word-to-word translation -- context is usually included (Melamed, 2001). Although paraphrasing is considered to be context dependent, there is no agreement on the extent. To evaluate the influence of context on paraphrasing judgments, we performed two experiments -- with and without context. First, the human judge is given a paraphrase pair without context, and after the judge entered his answer, he is given the same pair with its surrounding context. Each context was evaluated by two judges (other than the authors). The agreement was measured using the Kappa coefficient (Siegel and Castellan, 1988). Complete agreement between judges would correspond to K equals a15 ; if there is no agreement among judges, then K equals a48 .</Paragraph> <Paragraph position="2"> The judges agreement on the paraphrasing judgment without context was a55 a5 a48a39a56a58a57a39a59 which is substantial agreement (Landis and Koch, 1977). The first judge found 439(87.8%) pairs as correct paraphrases, and the second judge -426(85.2%). Judgments with context have even higher agreement (a55a60a5a46a48a39a56a58a61a39a62 ), and judges identified 459(91.8%) and 457(91.4%) pairs as correct paraphrases.</Paragraph> <Paragraph position="3"> The recall of our method is a more problematic issue. The algorithm can identify paraphrasing relations only between words which occurred in our corpus, which of course does not cover all English tokens. Furthermore, direct comparison with an electronic thesaurus like WordNet is impossible, because it is not known a priori which lexical relations in WordNet can form paraphrases. Thus, we can not evaluate recall. We hand-evaluated the coverage, by asking a human judges to extract paraphrases from 50 sentences, and then counted how many of these paraphrases where predicted by our algorithm. From 70 paraphrases extracted by human judge, 48(69%) were identified as paraphrases by our algorithm.</Paragraph> <Paragraph position="4"> In addition to evaluating our system output through precision and recall, we also compared our results with two other methods. The first of these was a machine translation technique for deriving bilingual lexicons (Melamed, 2001) including detection of non-compositional compounds 4. We did this evaluation on 60% of the full dataset; this is the portion of the data which is publicly available. Our system produced 6,826 word pairs from this data and Melamed provided the top 6,826 word pairs resulting from his system on this data. We randomly extracted 500 pairs each from both sets of output. Of the 500 pairs produced by our system, 354(70.8%) were single word pairs and 146(29.2%) were multi-word paraphrases, while the majority of pairs produced by Melamed's system were single word pairs (90%). We mixed this output and gave the resulting, randomly ordered 1000 pairs to six evaluators, all of whom were native speakers. Each evaluator provided judgments on 500 pairs without context. Precision for our system was 71.6% and for Melamed's was 52.7%. This increased precision is a clear advantage of our approach and shows that machine translation techniques cannot be used without modification for this task, particularly for producing multi-word paraphrases.</Paragraph> <Paragraph position="5"> There are three caveats that should be noted; Melamed's system was run without changes for this new task of paraphrase extraction and his system does not use chunk segmentation, he ran the system for three days of computation and the result may be improved with more running time since it makes incremental improvements on subsequent rounds, and finally, the agreement between human judges was lower than in our previous experiments. We are currently exploring whether the information produced by the two different systems may be combined to improve the performance of either system alone.</Paragraph> <Paragraph position="6"> Another view on the extracted paraphrases can be derived by comparing them with the Word-Net thesaurus. This comparison provides us with 4The equivalences that were identical on both sides were removed from the output quantitative evidence on the types of lexical relations people use to create paraphrases. We selected 112 paraphrasing pairs which occurred at least 20 times in our corpus and such that the words comprising each pair appear in WordNet.</Paragraph> <Paragraph position="7"> The 20 times cutoff was chosen to ensure that the identified pairs are general enough and not idiosyncratic. We use the frequency threshold to select paraphrases which are not tailored to one context. Examples of paraphrases and their WordNet relations are shown in Figure 7. Only 40(35%) paraphrases are synonyms, 36(32%) are hyperonyms, 20(18%) are siblings in the hyperonym tree, 11(10%) are unrelated, and the remaining 5% are covered by other relations. These figures quantitatively validate our intuition that synonymy is not the only source of paraphrasing. One of the practical implications is that using synonymy relations exclusively to recognize paraphrasing limits system performance.</Paragraph> <Paragraph position="8"> Synonyms: (rise, stand up), (hot, warm) Hyperonyms: (landlady, hostess), (reply, say) Siblings: (city, town), (pine, fir) Unrelated: (sick, tired), (next, then)</Paragraph> </Section> class="xml-element"></Paper>