File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1643_evalu.xml

Size: 11,513 bytes

Last Modified: 2025-10-06 13:59:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1643">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Skip-Chain Conditional Random Field for Ranking Meeting Utterances by Importance[?]</Title>
  <Section position="9" start_page="368" end_page="369" type="evalu">
    <SectionTitle>
7 Evaluation
</SectionTitle>
    <Paragraph position="0"> Evaluating summarization is a difficult problem and there is no broad consensus on how to best perform this task. Two metrics have become quite popular in multi-document summarization, namely the Pyramid method (Nenkova and Passonneau, 2004b) and ROUGE (Lin, 2004). Pyramid and ROUGE are techniques looking for content units repeated in different model summaries, i.e.,summarycontentunits(SCUs)suchasclauses and noun phrases for the Pyramid method, and n-grams for ROUGE. The underlying hypothesis is that different model sentences, clauses, or phrases may convey the same meaning, which is a reasonableassumptionwhendealingwithreferencesum- null maries produced by different authors, since it is quite unlikely that any two abstractors would use the exact same words to convey the same idea.</Paragraph>
    <Paragraph position="1"> Our situation is however quite different, since all model summaries of a given document are utterance extracts of that same document, as this can been seen in the excerpt of Figure 2. In our own annotation of three meetings with SCUs defined as in (Nenkova and Passonneau, 2004a), we found that repetitions and reformulation of the same information are particularly infrequent, and that textual units that express the same content among model summaries are generally originating from the same document sentence (e.g., in the figure, the first sentence in model 1 and 2 emanate from the same document sentence). Very short SCUs (e.g., base noun phrases) sometimes appeared in different locations of a meeting, but we think it is problematic to assume that connections between such short units are indicative of any similarity of sentential meaning: the contexts are different, and words may be uttered by different speakers, which may lead to unrelated or conflicting pragmatic forces. For instance, an SCU realized as &amp;quot;DC offset&amp;quot; and &amp;quot;DC component&amp;quot; appears in two different sentences in the figure, i.e. those identified as 1-13 and 31-41. However, the two sentences have contradictory meanings, and it would be unfortunate to increase the score of a peer summary containing the former sentence because the  latter is included in some model summaries.</Paragraph>
    <Paragraph position="2"> For all these reasons, we believe that summarization evaluation in our case should rely on the following restrictive matching: two summary units should be considered equivalent if and only if they are extracted from the same location in the original document (e.g., the &amp;quot;DC&amp;quot; appearing in models 1 and 2 is not the same as the &amp;quot;DC&amp;quot; in the peer summary, since they are extracted from different sentences). This constraint on the matching is reflected in our Pyramid evaluation, and we define an SCU as a word and its document position, which lets us distinguish (&amp;quot;DC&amp;quot;,11) from (&amp;quot;DC&amp;quot;,33). While this restriction on SCUs forces us to disregard scarcely occurring paraphrases and repetitions of the same information, it provides the benefit of automated evaluation.</Paragraph>
    <Paragraph position="3"> Once all SCUs have been identified, the Pyramid method is applied as in (Nenkova and Passonneau, 2004b): wecomputeascoreD byaddingfor each SCU present in the summary a score equal to the number of model summaries in which that SCU appears. The Pyramid score P is computed by dividing D by the maximum D[?] value that is obtainable given the constraint on length. For instance, the peer summary in the figure gets a score D = 9 (since the 9 SCUs in range 43-51 occur in one model), and the maximum obtainable score is D[?] = 44 (all SCUs of the optimal summary appear in exactly two model summaries), hence the peer summary's score is P = .204.</Paragraph>
    <Paragraph position="4"> While our evaluation scheme is similar to comparing the binary predictions of model and peer summaries--each prediction determining whether a given transcription word is included or not-andaveragingprecisionscoresoverallpeer-model null pairs, the Pyramid evaluation differs on an important point, which makes us prefer the Pyramid evaluation method: the maximum possible Pyramid score is always guaranteed to be 1, but average precision scores can become arbitrarily low as the consensus between summary annotators decreases. For instance, the average precision score of the optimal summary in the figure is PR = 23.2 2Precision scores of the optimal summary compared against the the three model summaries are .5, 1, and .5, respectively, and hence average 23. We can show that P = PR/PR[?], where PR[?] is the average precision of the optimal summary. Lack of space prevent us from providing a proof, so we will just show that the equality holds in our example: since the peer summary's precision scores against the three model summaries are respectively 922, 0, and 0, we have  In the case of the six test meetings, which all have either 3 or 4 model summaries, the maximum possible average precision is .6405.</Paragraph>
  </Section>
  <Section position="10" start_page="369" end_page="371" type="evalu">
    <SectionTitle>
8 Experiments
</SectionTitle>
    <Paragraph position="0"> We follow (Murray et al., 2005) in using the same six meetings as test data, since each of these meetings has multiple reference summaries. The remaining69meetingswereusedfortraining,which null represent in total more than 103,000 training instances (or DA units), of which 6,464 are positives (6.24%). The multi-reference test set contains more than 28,000 instances.</Paragraph>
    <Paragraph position="1"> The goal of a preliminary experiment was to devise a set of useful predictors from a full set of 1171. We performed feature selection by incrementally growing a log-linear model with order-0 features f(x,yt) using a forward feature selection procedure similar to (Berger et al., 1996).</Paragraph>
    <Paragraph position="2"> Probably due to the imbalance between positive and negative samples, we found it more effective to rank candidate features by gains in F-measure  (through5-foldcrossvalidationontheentiretrainingset). TheincreaseinF1 byaddingnewfeatures to the model is displayed in Table 4; this greedy search resulted in a set S of 217 features.</Paragraph>
    <Paragraph position="3"> We now analyze the performance of different sequence models on our test set. The target length of each summary was set to 12.7% of the number of words of the full document, which is the aver- null age on the entire training data (the average on the test data is 12.9%). In Table 5, we use an order-0 CRF to compare S against all features and various categorical groupings. Overall, we notice lexical predictors and statistics derived from them (e.g.</Paragraph>
    <Paragraph position="4"> LSA features) represent the most helpful feature group (.497), though all other features combined achieve a competitive performance (.476).</Paragraph>
    <Paragraph position="5"> Table 6 displays performance for sequence models incorporating linear-chain features of increasing order k. Its second column indicates what criterion was used to rank utterances. In the case of 'pred', we used actual model {[?]1,1} predictions, which in all cases generated summaries much shorted than the allowable length, and produced poor performance. 'Costs' and 'norm-CRF' refer to the two ranking criteria presented in Section 5, and it is clear that the performance of CRFs degrades with increasing orders without local normalization. While the contingency counts in Table 2 only hinted a limited benefit of linear-chain features, empirical results show the contrary-especially for order k = 2. However, the further increase of k causes overfitting, and skip-chain features seem a better way to capture non-local dependencies while keeping the number of model parameters relatively small. Overall, the addition of skip-chain edges to linear-chain models provide noticeable improvement in Pyramid scores. Our system that performed best on cross-validation data is an order-2 CRF with skip-chain transitions, which achieves a Pyramid score of P = .554.</Paragraph>
    <Paragraph position="6"> We now assess the significance of our results by comparing our best system against: (1) a lead summarizer that always selects the first N utterances to match the predefined length; (2) human performance, which is obtained by leave-one-out comparisons among references (Table 7); (3) &amp;quot;optimal&amp;quot; summaries generated using the procedure explained in (Nenkova and Passonneau, 2004b) by ranking document utterances by the number of model summaries in which they appear. It appears that our system is considerably better than the baseline, and achieves 91.3% of human performance in terms of Pyramid scores, and 83% if using ASR transcription. This last result is particularly positive if we consider our strong reliance on lexical features.</Paragraph>
    <Paragraph position="7"> For completeness, we also included standard ROUGE (1, 2, and L) scores in Table 7, which were obtained using parameters defined for the  k stands for the order of linear-chain features. The value in bold is the performance of the model that was selected after a 5-fold cross validation on the training data, which obtained  produces by a baseline (lead summarizer), our best system, humans, and the optimal summarizer.</Paragraph>
    <Paragraph position="8"> DUC-05 evaluation. Since system summaries have on average approximately the same length as references, we only report recall measures of ROUGE (precision and F averages are within +.002).3 It may come as a surprise that our best system (both with ASR and true words) performs almost as well as humans; it seems more reasonable to conclude that, in our case, ROUGE has trouble discriminating between systems with moderately close performance. This seems to confirm our impression that content evaluation in our task should be based on exact matches.</Paragraph>
    <Paragraph position="9"> We performed a last experiment to compare our bestsystemagainstMurrayetal.(2005), whoused the same test data, but constrained summary sizes in terms of number of DA units instead of words.</Paragraph>
    <Paragraph position="10"> In their experiments, 10% of DAs had to be selected. Our system achieves .91 recall, .5 precision, and .64 F1 with the same length constraint. 3Human performance with ROUGE was assessed by cross-validating reference summaries of each meeting (i.e., n references for a given meeting resulted in n evaluations against the other references). We used the same leave-one-out procedure with other summarizers, in order to get results comparable to humans.</Paragraph>
    <Paragraph position="11">  The discrepancy between recall and precision is largely due to the fact that generated summaries areonaveragemuchlongerthanmodelsummaries (10% vs. 6.26% of DAs), which explains why our precision is relatively low in this last evaluation.</Paragraph>
    <Paragraph position="12"> The best ROUGE-1 measure reported in (Murray et al., 2005) is .69 recall, which is significantly lower than ours according to confidence intervals.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML