File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1003_metho.xml
Size: 22,964 bytes
Last Modified: 2025-10-06 14:08:07
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1003"> <Title>Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Algorithm </SectionTitle> <Paragraph position="0"> Overview We first sketch the algorithm's broad outlines. The subsequent subsections provide more detailed descriptions of the individual steps.</Paragraph> <Paragraph position="1"> The major goals of our algorithm are to learn: a0 recurring patterns in the data, such as X (injured/wounded) Y people, Z seriously, where the capital letters represent variables; a0 pairings between such patterns that represent paraphrases, for example, between the pattern X (injured/wounded) Y people, Z of them seriously and the pattern Y were (wounded/hurt) by X, among them Z were in serious condition.</Paragraph> <Paragraph position="2"> Figure 1 illustrates the main stages of our approach. During training, pattern induction is first applied independently to the two datasets making up a pair of comparable corpora. Individual patterns are learned by applying paraphrase corpus 1 corpus 2 lattice 1 lattice 1 (1) A Palestinian suicide bomber blew himself up in a southern city Wednesday, killing two other people and wounding 27.</Paragraph> <Paragraph position="3"> (2) A suicide bomber blew himself up in the settlement of Efrat, on Sunday, killing himself and injuring seven people. null (3) A suicide bomber blew himself up in the coastal resort of Netanya on Monday, killing three other people and wounding dozens more.</Paragraph> <Paragraph position="4"> (4) A Palestinian suicide bomber blew himself up in a garden cafe on Saturday, killing 10 people and wounding 54. (5) A suicide bomber blew himself up in the centre of Netanya on Sunday, killing three people as well as himself and injuring 40.</Paragraph> <Paragraph position="5"> name substitution) from a cluster of 49, similarities emphasized. null multiple-sequence alignment to clusters of sentences describing approximately similar events; these patterns are represented compactly by lattices (see Figure 3). We then check for lattices from the two different corpora that tend to take the same arguments; these lattice pairs are taken to be paraphrase patterns.</Paragraph> <Paragraph position="6"> Once training is done, we can generate paraphrases as follows: given the sentence &quot;The surprise bombing injured twenty people, five of them seriously&quot;, we match it to the lattice X (injured/wounded) Y people, Z of them seriously which can be rewritten as Y were (wounded/hurt) by X, among them Z were in serious condition, and so by substituting arguments we can generate &quot;Twenty were wounded by the surprise bombing, among them five were in serious condition&quot; or &quot;Twenty were hurt by the surprise bombing, among them five were in serious condition&quot;.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Sentence clustering </SectionTitle> <Paragraph position="0"> Our first step is to cluster sentences into groups from which to learn useful patterns; for the multiple-sequence techniques we will use, this means that the sentences within clusters should describe similar events and have similar structure, as in the sentences of Figure 2. This is accomplished by applying hierarchical complete-link clustering to the sentences using a similarity metric based on word n-gram overlap (a0a2a1 a3a5a4a7a6a8a4a10a9a11a4a13a12 ). The only subtlety is that we do not want mismatches on sentence details (e.g., the location of a raid) causing sentences describing the same type of occurrence (e.g., a raid) from being separated, as this might yield clusters too fragmented for effective learning to take place. (Moreover, variability in the arguments of the sentences in a cluster is needed for our learning algorithm to succeed; see below.) We therefore first replace all appearances of dates, numbers, and proper names2 with generic tokens. Clusters with fewer than ten sentences are discarded.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Inducing patterns </SectionTitle> <Paragraph position="0"> In order to learn patterns, we first compute a multiple-sequence alignment (MSA) of the sentences in a given cluster. Pairwise MSA takes two sentences and a scoring function giving the similarity between words; it determines the highest-scoring way to perform insertions, deletions, and changes to transform one of the sentences into the other. Pairwise MSA can be extended efficiently to multiple sequences via the iterative pairwise alignment, a polynomial-time method commonly used in computational biology (Durbin et al., 1998).3 The results can be represented in an intuitive form via a word lattice (see Figure 3), which compactly represents (n-gram) structural similarities between the cluster's sentences.</Paragraph> <Paragraph position="1"> To transform lattices into generation-suitable patterns requires some understanding of the possible varieties of lattice structures. The most important part of the transformation is to determine which words are actually instances of arguments, and so should be replaced by slots (representing variables). The key intuition is that because the sentences in the cluster represent the same type of event, such as a bombing, but generally refer to different instances of said event (e.g. a bombing in Jerusalem versus in Gaza), areas of large variability in the lattice should correspond to arguments.</Paragraph> <Paragraph position="2"> To quantify this notion of variability, we first formalize its opposite: commonality. We define backbone nodes ferent words scores -0.5 (parameter values taken from Barzilay and Lee (2002)).</Paragraph> <Paragraph position="3"> start Palestinian suicide bomber blew himself up in SLOT1 on SLOT2 killing SLOT3 other people and wounding as those shared by more than 50% of the cluster's sentences. The choice of 50% is not arbitrary -- it can be proved using the pigeonhole principle that our strictmajority criterion imposes a unique linear ordering of the backbone nodes that respects the word ordering within the sentences, thus guaranteeing at least a degree of well-formedness and avoiding the problem of how to order backbone nodes occurring on parallel &quot;branches&quot; of the lattice.</Paragraph> <Paragraph position="4"> Once we have identified the backbone nodes as points of strong commonality, the next step is to identify the regions of variability (or, in lattice terms, many parallel disjoint paths) between them as (probably) corresponding to the arguments of the propositions that the sentences represent. For example, in the top of Figure 3, the words &quot;southern city, &quot;settlement of NAME&quot;,&quot;coastal resort of NAME&quot;, etc. all correspond to the location of an event and could be replaced by a single slot. Figure 3 shows an example of a lattice and the derived slotted lattice; we give the details of the slot-induction process in the Appendix. null</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Matching lattices </SectionTitle> <Paragraph position="0"> Now, if we were using a parallel corpus, we could employ sentence-alignment information to determine which lattices correspond to paraphrases. Since we do not have this information, we essentially approximate the parallelcorpus situation by correlating information from descriptions of (what we hope are) the same event occurring in the two different corpora.</Paragraph> <Paragraph position="1"> Our method works as follows. Once lattices for each corpus in our comparable-corpus pair are computed, we identify lattice paraphrase pairs, using the idea that paraphrases will tend to take the same values as arguments (Shinyama et al., 2002; Lin and Pantel, 2001). More specifically, we take a pair of lattices from different corpora, look back at the sentence clusters from which the two lattices were derived, and compare the slot values of those cross-corpus sentence pairs that appear in articles written on the same day on the same topic; we pair the lattices if the degree of matching is over a threshold tuned on held-out data. For example, suppose we have two (linearized) lattices slot1 bombed slot2 and slot3 was bombed by slot4 drawn from different corpora.</Paragraph> <Paragraph position="2"> If in the first lattice's sentence cluster we have the sentence &quot;the plane bombed the town&quot;, and in the second lattice's sentence cluster we have a sentence written on the same day reading &quot;the town was bombed by the plane&quot;, then the corresponding lattices may well be paraphrases, where slot1 is identified with slot4 and slot2 with slot3.</Paragraph> <Paragraph position="3"> To compare the set of argument values of two lattices, we simply count their word overlap, giving double weight to proper names and numbers and discarding auxiliaries (we purposely ignore order because paraphrases can consist of word re-orderings).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Generating paraphrase sentences </SectionTitle> <Paragraph position="0"> Given a sentence to paraphrase, we first need to identify which, if any, of our previously-computed sentence clusters the new sentence belongs most strongly to. We do this by finding the best alignment of the sentence to the existing lattices.4 If a matching lattice is found, we choose one of its comparable-corpus paraphrase lattices to rewrite the sentence, substituting in the argument values of the original sentence. This yields as many paraphrases as there are lattice paths.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> All evaluations involved judgments by native speakers of English who were not familiar with the paraphrasing systems under consideration.</Paragraph> <Paragraph position="1"> We implemented our system on a pair of comparable corpora consisting of articles produced between September 2000 and August 2002 by the Agence France-Presse (AFP) and Reuters news agencies. Given our interest in domain-dependent paraphrasing, we limited attention to The 50 common template pairs, sorted by judges' perceived validity &quot;palestinian suicide bomber blew himself up at X1 in X2 DATE, killing NUM1 and wounding NUM2 police said.&quot; &quot;DATE: NUM1 are killed and around NUM2 injured when suicide bomber blows up his explosive[?]packed belt at X1 in X2.&quot;; &quot;X1 stormed into X2&quot;; &quot;X1 thrusted into X2&quot; &quot;palestinian suicide bomber blew himself up at X1 in X2 on DATE, killing NUM1 and wounding NUM2, police said.&quot; &quot;[?] DATE: bombing at X1 in X2 kills NUM1 israelis and leaves NUM2.&quot;; &quot;X1's candidacy for X2&quot;; &quot;X2 expressed X1's condemnation&quot; The 50 common template pairs, sorted by judges' perceived validity &quot;latest violence bring to NUM1 number of people killed as a direct result of the palestinian [sic], including NUM2 palestinians and NUM3 israelis.&quot;; &quot;At least NUM1 palestinians and NUM2 israelis have been killed since palestinian uprising against israeli occupation began in September 2000 after peace talks stalled.&quot; for the instance. For each method, a good, middling, and poor instance is shown. (Results separated by algorithm for clarity; the blind evaluation presented instances from the two algorithms in random order.) out parameter-training set), we extracted 43 slotted lattices from the AFP corpus and 32 slotted lattices from the Reuters corpus, and found 25 cross-corpus matching pairs; since lattices contain multiple paths, these yielded 6,534 template pairs.5</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Template Quality Evaluation </SectionTitle> <Paragraph position="0"> Before evaluating the quality of the rewritings produced by our templates and lattices, we first tested the quality of a random sample of just the template pairs. In our instructions to the judges, we defined two text units (such as sentences or snippets) to be paraphrases if one of them can generally be substituted for the other without great loss of information (but not necessarily vice versa).</Paragraph> <Paragraph position="1"> 6 Given a pair of templates produced by a system, the judges marked them as paraphrases if for many instantiations of the templates' variables, the resulting text units were paraphrases. (Several labelled examples were provided to supply further guidance).</Paragraph> <Paragraph position="2"> To put the evaluation results into context, we wanted to compare against another system, but we are not aware tial tests judges found it excruciating to decide on equivalence. Also, in applications such as summarization some information loss is acceptable.</Paragraph> <Paragraph position="3"> of any previous work creating templates precisely for the task of generating paraphrases. Instead, we made a good-faith effort to adapt the DIRT system (Lin and Pantel, 2001) to the problem, selecting the 6,534 highest-scoring templates it produced when run on our datasets. (The system of Shinyama et al. (2002) was unsuitable for evaluation purposes because their paraphrase extraction component is too tightly coupled to the underlying information extraction system.) It is important to note some important caveats in making this comparison, the most prominent being that DIRT was not designed with sentence-paraphrase generation in mind -- its templates are much shorter than ours, which may have affected the evaluators' judgments -- and was originally implemented on much larger data sets.7 The point of this evaluation is simply to determine whether another corpus-based paraphrase-focused approach could easily achieve the same performance level.</Paragraph> <Paragraph position="4"> In brief, the DIRT system works as follows. Dependency trees are constructed from parsing a large corpus. Leaf-to-leaf paths are extracted from these dependency 7To cope with the corpus-size issue, DIRT was trained on an 84MB corpus of Middle-East news articles, a strict superset of the 9MB we used. Other issues include the fact that DIRT's output needed to be converted into English: it produces paths like &quot;N:of:N a0 tidea1 N:nn:N&quot;, which we transformed into &quot;Y tide of X&quot; so that its output format would be the same as ours. trees, with the leaves serving as slots. Then, pairs of paths in which the slots tend to be filled by similar values, where the similarity measure is based on the mutual information between the value and the slot, are deemed to be paraphrases.</Paragraph> <Paragraph position="5"> We randomly extracted 500 pairs from the two algorithms' output sets. Of these, 100 paraphrases (50 per system) made up a &quot;common&quot; set evaluated by all four judges, allowing us to compute agreement rates; in addition, each judge also evaluated another &quot;individual&quot; set, seen only by him- or herself, consisting of another 100 pairs (50 per system). The &quot;individual&quot; sets allowed us to broaden our sample's coverage of the corpus.8 The pairs were presented in random order, and the judges were not told which system produced a given pair.</Paragraph> <Paragraph position="6"> As Figure 4 shows, our system outperforms the DIRT system, with a consistent performance gap for all the judges of about 38%, although the absolute scores vary (for example, Judge 4 seems lenient). The judges' assessment of correctness was fairly constant between the full 100-instance set and just the 50-instance common set alone.</Paragraph> <Paragraph position="7"> In terms of agreement, the Kappa value (measuring pairwise agreement discounting chance occurrences9) on the common set was 0.54, which corresponds to moderate agreement (Landis and Koch, 1977). Multiway agreement is depicted in Figure 4 -- there, we see that in 86 of 100 cases, at least three of the judges gave the same correctness assessment, and in 60 cases all four judges concurred.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Evaluation of the generated paraphrases </SectionTitle> <Paragraph position="0"> Finally, we evaluated the quality of the paraphrase sentences generated by our system, thus (indirectly) testing all the system components: pattern selection, paraphrase acquisition, and generation. We are not aware of another system generating sentence-level paraphrases. Therefore, we used as a baseline a simple paraphrasing system that just replaces words with one of their randomly-chosen WordNet synonyms (using the most frequent sense of the word that WordNet listed synonyms for). The number of substitutions was set proportional to the number of words our method replaced in the same sentence. The point of this comparison is to check whether simple synonym substitution yields results comparable to those of our algo- null varying difficulty among instances. For this reason, we actually asked judges to indicate for each instance whether making the validity decision was difficult. However, the judges generally did not agree on difficulty. Post hoc analysis indicates that perception of difficulty depends on each judge's individual &quot;threshold of similarity&quot;, not just the instance itself. rithm. 10 For this experiment, we randomly selected 20 AFP articles about violence in the Middle East published later than the articles in our training corpus. Out of 484 sentences in this set, our system was able to paraphrase 59 (12.2%). (We chose parameters that optimized precision rather than recall on our small held-out set.) We found that after proper name substitution, only seven sentences in the test set appeared in the training set,11 which implies that lattices boost the generalization power of our method significantly: from seven to 59 sentences. Interestingly, the coverage of the system varied significantly with article length. For the eight articles of ten or fewer sentences, we paraphrased 60.8% of the sentences per article on average, but for longer articles only 9.3% of the sentences per article on average were paraphrased. Our analysis revealed that long articles tend to include large portions that are unique to the article, such as personal stories of the event participants, which explains why our algorithm had a lower paraphrasing rate for such articles.</Paragraph> <Paragraph position="1"> All 118 instances (59 per system) were presented in random order to two judges, who were asked to indicate whether the meaning had been preserved. Of the paraphrases generated by our system, the two evaluators deemed 81.4% and 78%, respectively, to be valid, whereas for the baseline system, the correctness results were 69.5% and 66.1%, respectively. Agreement according to the Kappa statistic was 0.6. Note that judging full sentences is inherently easier than judging templates, because template comparison requires considering a variety of possible slot values, while sentences are self-contained units.</Paragraph> <Paragraph position="2"> Figure 5 shows two example sentences, one where our MSA-based paraphrase was deemed correct by both judges, and one where both judges deemed the MSAgenerated paraphrase incorrect. Examination of the results indicates that the two systems make essentially orthogonal types of errors. The baseline system's relatively poor performance supports our claim that whole-sentence paraphrasing is a hard task even when accurate word-level paraphrases are given.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Conclusions </SectionTitle> <Paragraph position="0"> We presented an approach for generating sentence level paraphrases, a task not addressed previously. Our method learns structurally similar patterns of expression from data and identifies paraphrasing pairs among them using a comparable corpus. A flexible pattern-matching procedure allows us to paraphrase an unseen sentence by 10We chose not to employ a language model to re-rank either system's output because such an addition would make it hard to isolate the contribution of the paraphrasing component itself.</Paragraph> <Paragraph position="1"> 11Since we are doing unsupervised paraphrase acquisition, train-test overlap is allowed.</Paragraph> <Paragraph position="2"> (2), and that neither baseline paraphrase was meaning-preserving. matching it to one of the induced patterns. Our approach generates both lexical and structural paraphrases.</Paragraph> <Paragraph position="3"> Another contribution is the induction of MSA lattices from non-parallel data. Lattices have proven advantageous in a number of NLP contexts (Mangu et al., 2000; Bangalore et al., 2002; Barzilay and Lee, 2002; Pang et al., 2003), but were usually produced from (multi-)parallel data, which may not be readily available for many applications. We showed that word lattices can be induced from a type of corpus that can be easily obtained for many domains, broadening the applicability of this useful representation.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> Acknowledgments </SectionTitle> <Paragraph position="0"> We are grateful to many people for helping us in this work.</Paragraph> <Paragraph position="1"> We thank Stuart Allen, Itai Balaban, Hubie Chen, Tom Heyerman, Evelyn Kleinberg, Carl Sable, and Alex Zubatov for acting as judges. Eric Breck helped us with translating the output of the DIRT system. We had numerous very useful conversations with all those mentioned above and with Eli Barzilay, Noemie Elhadad, Jon Kleinberg (who made the &quot;pigeonhole&quot; observation), Mirella Lapata, Smaranda Muresan and Bo Pang. We are very grateful to Dekang Lin for providing us with DIRT's output. We thank the Cornell NLP group, especially Eric Breck, Claire Cardie, Amanda Holland-Minkley, and Bo Pang, for helpful comments on previous drafts. This paper is based upon work supported in part by the National Science Foundation under ITR/IM grant IIS-0081334 and a Sloan Research Fellowship. Any opinions, findings, and conclusions or recommendations expressed above are those of the authors and do not necessarily reflect the views of the National Science Foundation or the Sloan Foundation.</Paragraph> </Section> class="xml-element"></Paper>