File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1057_metho.xml
Size: 16,882 bytes
Last Modified: 2025-10-06 14:10:10
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1057"> <Title>ParaEval: Using Paraphrases to Evaluate Summaries Automatically</Title> <Section position="4" start_page="447" end_page="448" type="metho"> <SectionTitle> 3 Motivation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="447" end_page="448" type="sub_section"> <SectionTitle> 3.1 Paraphrase Matching </SectionTitle> <Paragraph position="0"> An important difference that separates current manual evaluation methods from their automatic counterparts is that semantic matching of content units is performed by human summary assessors.</Paragraph> <Paragraph position="1"> An essential part of the semantic matching involves paraphrase matching--determining whether phrases worded differently carry the same semantic information. This paraphrase matching process is observed in the Pyramid annotation procedure shown in (Nenkova and Passonneau, 2004) over three summary sets (10 summaries each). In the example shown in Figure 1 (reproduced from Pyramid results), each of the 10 phrases (numbered 1 to 10) extracted from summary sentences carries the same semantic content as the overall summary content unit labeled SCU1 does. Each extracted phrase is identified as a summary content unit (SCU). In our work in building an automatic evaluation procedure that enables paraphrase SCU1: the crime in question was the Lockerbie {Scotland} bombing matching, we aim to automatically identify these 10 phrases as paraphrases of one another.</Paragraph> </Section> <Section position="2" start_page="448" end_page="448" type="sub_section"> <SectionTitle> 3.2 Synonymy Relations </SectionTitle> <Paragraph position="0"> Synonym matching and paraphrase matching are often mentioned in the same context in discussions of extending current automated summarization evaluation methods to incorporate the matching of semantic units. While evaluating automatically extracted paraphrases via WordNet (Miller et al., 1990), Barzilay and McKeown (2001) quantitatively validated that synonymy is not the only source of paraphrasing. We envisage that this claim is also valid for summary comparisons.</Paragraph> <Paragraph position="1"> From an in-depth analysis on the manually created SCUs of the DUC2003 summary set D30042 (Nenkova and Passonneau, 2004), we find that 54.48% of 1746 cases where a non-stop word from one SCU did not match with its supposedly human-aligned pairing SCUs are in need of some level of paraphrase matching support. For example, in the first two extracted SCUs (labeled as 1 and 2) in Figure 1--&quot;for the Lockerbie bombing&quot; and &quot;for blowing up ... over Lockerbie, Scotland&quot;--no non-stop word other than the word &quot;Lockerbie&quot; occurs in both phrases. But these two phrases were judged to carry the same semantic meaning because human annotators think the word &quot;bombing&quot; and the phrase &quot;blowing up&quot; refer to the same action, namely the one associated with &quot;explosion.&quot; However, &quot;bombing&quot; and &quot;blowing up&quot; cannot be matched through synonymy relations by using WordNet, since one is a noun and the other is a verb phrase (if tagged within context). Even when the search is extended to finding synonyms and hypernyms for their categorical variants and/or using other parts of speech (verb for &quot;bombing&quot; and noun phrase for &quot;blowing up&quot;), a match still cannot be found.</Paragraph> <Paragraph position="2"> To include paraphrase matching in summary evaluation, a collection of less-strict paraphrases must be created and a matching strategy needs to be investigated.</Paragraph> </Section> </Section> <Section position="5" start_page="448" end_page="449" type="metho"> <SectionTitle> 4 Paraphrase Acquisition </SectionTitle> <Paragraph position="0"> Paraphrases are alternative verbalizations for conveying the same information and are required by many Natural Language Processing (NLP) applications. In particular, summary creation and evaluation methods need to recognize paraphrases and their semantic equivalence. Unfortunately, we have yet to incorporate into the evaluation framework previous findings in paraphrase identification and extraction (Barzilay and McKeown, 2001; Pang et al., 2003; Bannard and Callison-Burch, 2005).</Paragraph> <Section position="1" start_page="448" end_page="448" type="sub_section"> <SectionTitle> 4.1 Related Work on Paraphrasing </SectionTitle> <Paragraph position="0"> Three major approaches in paraphrase collection are manual collection (domain-specific), collection utilizing existing lexical resources (i.e. WordNet), and derivation from corpora. Hermjakob et al.</Paragraph> <Paragraph position="1"> (2002) view paraphrase recognition as reformulation by pattern recognition. Pang et al.</Paragraph> <Paragraph position="2"> (2003) use word lattices as paraphrase representations from semantically equivalent translations sets. Using parallel corpora, Barzilay and McKeown (2001) identify paraphrases from multiple translations of classical novels, where as Bannard and Callison-Burch (2005) develop a probabilistic representation for paraphrases extracted from large Machine Translation (MT) data sets.</Paragraph> </Section> <Section position="2" start_page="448" end_page="449" type="sub_section"> <SectionTitle> 4.2 Extracting Paraphrases </SectionTitle> <Paragraph position="0"> Our method to automatically construct a large domain-independent paraphrase collection is based on the assumption that two different English phrases of the same meaning may have the same translation in a foreign language.</Paragraph> <Paragraph position="1"> (SMT) systems analyze large quantities of bilingual parallel texts in order to learn translational alignments between pairs of words and phrases in two languages (Och and Ney, 2004). The sentence-based translation model makes word/phrase alignment decisions probabilistically by computing the optimal model parameters with application of the statistical estimation theory. This alignment process results in a corpus of word/phrase-aligned parallel sentences from which we can extract phrase pairs that are translations of each other. We ran the alignment algorithm from (Och and Ney, 2003) on a Chinese-English parallel corpus of 218 million English words. Phrase pairs are extracted by following the method described in (Och and Ney, 2004) where all contiguous phrase pairs having consistent alignments are extraction candidates.</Paragraph> <Paragraph position="2"> The resulting phrase table is of high quality; both the alignment models and phrase extraction meth- null ods have been shown to produce very good results for SMT. Using these pairs we build paraphrase sets by joining together all English phrases with the same Chinese translation. Figure 2 shows an example word/phrase alignment for two parallel sentence pairs from our corpus where the phrases &quot;blowing up&quot; and &quot;bombing&quot; have the same Chinese translation. On the right side of the figure we show the paraphrase set which contains these two phrases, which is typical in our collection of extracted paraphrases.</Paragraph> </Section> </Section> <Section position="6" start_page="449" end_page="451" type="metho"> <SectionTitle> 5 Summary Comparison in ParaEval </SectionTitle> <Paragraph position="0"> This section describes the process of comparing a peer summary against a reference summary and the summary grading mechanism.</Paragraph> <Section position="1" start_page="449" end_page="449" type="sub_section"> <SectionTitle> 5.1 Description </SectionTitle> <Paragraph position="0"> We adopt a three-tier matching strategy for summary comparison. The score received by a peer summary is the ratio of the number of reference words matched to the total number of words in the reference summary. The total number of matched reference words is the sum of matched words in reference throughout all three tiers. At the top level, favoring high recall coverage, we perform an optimal search to find multi-word paraphrase matches between phrases in the reference summary and those in the peer. Then a greedy search is performed to find single-word paraphrase/synonym matches among the remaining text. Operations conducted in these two top levels are marked as linked rounded rectangles in Figure 3. At the bottom level, we find lexical identity matches, as marked in rectangles in the example. If no paraphrases are found, this last level provides a guarantee of lexical comparison that is equivalent to what other automated systems give. In our system, the bottom level currently performs unigram matching. Thus, we are ensured with at least a ROUGE-1 type of summary comparison. Alternatively, equivalence of other ROUGE configurations can replace the ROUGE-1 implementation. There is no theoretical reason why the first two levels should not merge. But due to high computational cost in modeling an optimal search, the separation is needed. We explain this in detail below.</Paragraph> </Section> <Section position="2" start_page="449" end_page="449" type="sub_section"> <SectionTitle> 5.2 Multi-Word Paraphrase Matching </SectionTitle> <Paragraph position="0"> In this section we describe the algorithm that performs the multi-word paraphrase matching between phrases from reference and peer summaries.</Paragraph> <Paragraph position="1"> Using the example in Figure 3, this algorithm creates the phrases shown in the rounded rectangles and establishes the appropriate links indicating corresponding paraphrase matches.</Paragraph> </Section> <Section position="3" start_page="449" end_page="451" type="sub_section"> <SectionTitle> Problem Description </SectionTitle> <Paragraph position="0"> Measuring content coverage of a peer summary using a single reference summary requires computing the recall score of how much information from the reference summary is included in the peer. A summary unit, either from reference or peer, cannot be matched for more than once. For example, the phrase &quot;imposed sanctions on Libya&quot;</Paragraph> <Paragraph position="2"> ) in Figure 3's reference summary was matched with the peer summary's &quot;voted sanctions against cannot be counted twice. Conversely, double counting is not permissible for phrase/words in the peer summary, either. We conceptualize the comparison of peer against reference as a task that is to complete over several time intervals. If the reference summary contains n sentences, there will be n time intervals, where at time t</Paragraph> <Paragraph position="4"> , phrases from a particular sentence i of the reference summary are being considered with all possible phrases from the peer summary for paraphrase matches. A decision needs to be made at each time interval: * Do we employ a local greedy match algorithm that is recall generous (preferring more matched words from reference) towards only the reference sentence currently being analyzed, * Or do we need to explore globally, inspecting all reference sentences and find the best overall matching combinations? Consider the scenario in Figure 4: search algorithm would have returned no match. Clearly, the global search algorithm achieves higher overall recall (in words). The matching of paraphrases between a reference and its peer becomes a global optimization problem, maximizing the content coverage of the peer compared in reference. null Solution Model We use dynamic programming to derive the solution of finding the best paraphrase-matching combinations. The optimization problem is as follows: Sentences from a reference summary and a peer summary can be broken into phrases of various lengths. A paraphrase lookup table is used to find whether a reference phrase and a peer phrase are paraphrases of each other. What is the optimal paraphrase matching combination of phrases from reference and peer that gives the highest recall score (in number of matched reference words) for this given peer? The solution should be recall oriented (favoring a peer phrase that matches more reference words than those match less).</Paragraph> <Paragraph position="5"> Following (Trick, 1997), the solution can be characterized as: 1) This problem can be divided into n stages corresponding to the n sentences of the reference summary. At each stage, a decision is required to determine the best combination of matched paraphrases between the reference sentence and the entire peer summary that results in no double counting of phrases on the peer side. There is no double counting of reference phrases across stages since we are processing one reference sentence at a time and are finding the best paraphrase matches using the entire peer summary. As long as there is no double counting in peers, we are guaranteed to have none in reference, either.</Paragraph> <Paragraph position="6"> 2) At each stage, we define a number of possible states as follows. If, out of all possible phrases of any length extracted from the reference sentence, m phrases were found to have matching paraphrases in the peer summary, then a state is any subset of the m phrases.</Paragraph> <Paragraph position="7"> 3) Since no double counting in matched phrases/words is allowed in either the reference summary or the peer summary, the decision of which phrases (leftover text segments in reference</Paragraph> <Paragraph position="9"> represent phrases chosen for paraphrase matching from peer and reference respectively.</Paragraph> <Paragraph position="11"> indicates that the phrase P j from peer is found to be a paraphrase to the phrase r</Paragraph> <Paragraph position="13"> when they are found to be paraphrases of each other.</Paragraph> <Paragraph position="15"> ) may not be equal if the number of words in r</Paragraph> <Paragraph position="17"> ), does not equal to the number of words in r</Paragraph> <Paragraph position="19"> and in peer) are allowed to match for the next stage is made in the current stage.</Paragraph> <Paragraph position="20"> 4) Principle of optimality: at a given state, it is not necessary to know what matches occurred at previous stages, only on the accumulated recall score (matched reference words) from previous stages and what text segments (phrases) in peer have not been taken/matched in previous stages.</Paragraph> <Paragraph position="21"> 5) There exists a recursive relationship that identifies the optimal decision for stage s (out of n total stages), given that stage s+1 has already been solved.</Paragraph> <Paragraph position="22"> 6) The final stage, n (last sentence in reference), is solved by choosing the state that has the highest accumulated recall score and yet resulted no double counting in any phrase/word in peer the summary.</Paragraph> <Paragraph position="23"> Figure 5 demonstrates the optimal solution (12 reference words matched) for the example shown in Figure 4. We can express the calculations in the following formulas: where f</Paragraph> <Paragraph position="25"> ) denotes the optimal recall coverage (number of words in the reference summary matched by the phrases from the peer summary) at state x b in stage y. r(x b ) is the recall coverage given state x</Paragraph> </Section> <Section position="4" start_page="451" end_page="451" type="sub_section"> <SectionTitle> 5.3 Synonym Matching </SectionTitle> <Paragraph position="0"> All paraphrases whose pairings do not involve multi-word to multi-word matching are called synonyms in our experiment. Since these phrases have either a n-to-1 or 1-to-n matching ratio (such as the phrases &quot;blowing up&quot; and &quot;bombing&quot;), a greedy algorithm favoring higher recall coverage reduces the state creation and stage comparison costs associated with the optimal procedure ) for state creation, and for 2 stages at any time)). The paraphrase table described in Section 4 is used.</Paragraph> <Paragraph position="1"> Synonym matching is performed only on parts of the reference and peer summaries that were not matched from the multi-word paraphrase-matching phase.</Paragraph> </Section> <Section position="5" start_page="451" end_page="451" type="sub_section"> <SectionTitle> 5.4 Lexical Matching </SectionTitle> <Paragraph position="0"> This matching phase performs straightforward lexical matching, as exemplified by the text fragments marked in rectangles in Figure 3. Unigrams are used as the units for counting matches in accordance with the previous two matching phases.</Paragraph> <Paragraph position="1"> During all three matching phases, we employed a ROUGE-1 style of counting. Other alternatives, such as ROUGE-2, ROUGE-SU4, etc., can easily be adapted to each phase.</Paragraph> </Section> </Section> class="xml-element"></Paper>