File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1031_metho.xml
Size: 16,571 bytes
Last Modified: 2025-10-06 14:09:30
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1031"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 241-248, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Automatically Learning Cognitive Status for Multi-Document Summarization of Newswire</Title> <Section position="3" start_page="243" end_page="246" type="metho"> <SectionTitle> 4 Machine learning experiments </SectionTitle> <Paragraph position="0"> The distinction between hearer-old and hearer-new entities depends on the readers. In other words, we are attempting to automatically infer which characters would be hearer-old for the intended readership of the original reports, which is also expected to be the intended readership of the summaries. For our experiments, we used the WEKA (Witten and Frank, 2005) machine learning toolkit and obtained the best results for hearer-old/new using a support vector machine (SMO algorithm) and for major/minor, a tree-based classi er (J48). We used WEKA's default settings for both algorithms.</Paragraph> <Paragraph position="1"> We now discuss what features we used for our two classi cation tasks (cf. list of features in table 1). Our hypothesis is that features capturing the frequency and syntactic and lexical forms of references are suf cient to infer the desired cognitive model.</Paragraph> <Paragraph position="2"> Intuitively, pronominalization indicates that an entity was particularly salient at a speci c point of the discourse, as has been widely discussed in attentional status and centering literature (Grosz and Sidner, 1986; Gordon et al., 1993). Modi ed noun phrases (with apposition, relative clauses or premodi cation) can also signal different status.</Paragraph> <Paragraph position="3"> In addition to the syntactic form features, we used two months worth of news articles collected over the web (and independent of the DUC collection we use in our experiments here) to collect unigram and bi-gram lexical models of rst mentions of people. The names themselves were removed from the rst mention noun phrase and the counts were collected over the premodi ers only. One of the lexical features we used is whether a person's description contains any of the 20 most frequent description words from our web corpus. We reasoned that these frequent de- null 0,1: Number of references to the person, including pronouns (total and normalized by feature 16) 2,3: Number of times apposition was used to describe the person(total and normalized by feature 16) 4,5: Number of times a relative clause was used to describe the person (total and normalized by 16) 6: Number of times the entity was referred to by name after the rst reference 7,8: Number of copula constructions involving the per-son (total and normalized by feature 16) 9,10: Number of apposition, relative clause or copula descriptions (total and normalized by feature 16) 11,12,13: Probability of an initial reference according to the bigram model (av.,max and min of all initial references) null 14: Number of top 20 high frequency description words (from references to people in large news corpus) present in initial references 15: Proportion of rst references containing full name 16: Total number of documents containing the person 17,18: Number of appositives or relative clause attaching to initial references (total and normalized by fea null scriptors may signal importance; the full list is: president, former, spokesman, sen, dr, chief, coach, attorney, minister, director, gov, rep, leader, secretary, rev, judge, US, general, manager, chairman.</Paragraph> <Paragraph position="4"> Another lexical feature was the overall likelihood of a person's description using the bigram model from our web corpus. This indicates whether a per-son has a role or af liation that is frequently mentioned. We performed 20-fold cross validation for both classi cation tasks. The results are shown in Table 2 (accuracy) and Table 3 (precision/recall).</Paragraph> <Section position="1" start_page="244" end_page="245" type="sub_section"> <SectionTitle> 4.1 Major vs Minor results </SectionTitle> <Paragraph position="0"> For major/minor classi cation, the majority class prediction has 94% accuracy, but is not a useful baseline as it predicts that no person should be mentioned by name and all are minor characters. J48 correctly predicts 114 major characters out of 258 in the 170 document sets. As recall appeared low, we further analyzed the 148 persons from DUC'03 and DUC'04 sets, for which DUC provides four human summaries. Table 4 presents the distribution of recall taking into account how many humans mentioned the person by name in their summary (originally, entities were labeled as main if any summary had a reference to them, cf. a2 3.2). It can be seen that recall is high (0.84) when all four humans consider a character to be major, and falls to 0.2 when only one out of four humans does. These observations reect the well-known fact that humans differ in their choices for content selection, and indicate that in the automatic learning is more successful when there is more human agreement.</Paragraph> <Paragraph position="1"> In our data there were 258 people mentioned by name in at least one human summary. In addition, there were 103 people who were mentioned in at least one human summary using only a common noun reference (these were identi ed by hand, as common noun coreference cannot be performed reliably enough by automatic means), indicating that 29% of people mentioned in human summaries are not actually named. Examples of such references include an off duty black policeman, a Nigerian born Roman catholic priest, Kuwait's US ambassador. For the purpose of generating references in a summary, it is important to evaluate how many of these people are correctly classi ed as minor characters. We removed these people from the training data and kept them as a test set. WEKA achieved a testing accuracy of 74% on these 103 test examples. But as discussed before, different human summarizers sometimes made different decisions on the form of reference to use. Out of the 103 referent for which a non-named reference was used by a summarizer, there were 40 where other summarizers used named reference. Only 22 of these 40 were labeled as minor characters in our automatic procedure. Out of the 63 people who were not named in any summary, but mentioned in at least one by common noun reference, WEKA correctly predicted 58 (92%) as minor characters. As before, we observe that when human summarizers generate references of the same form (re ecting consensus on conveying the perceived importance of the character), the machine predictions are accurate.</Paragraph> <Paragraph position="2"> We performed feature selection to identify which are the most important features for the classi cation task. For the major/minor classi cation, the important features used by the classi er were the number of documents the person was mentioned in (feature 16), number of mentions within the document set (features 1,6), number of relative clauses (feature 4,5) and copula (feature 8) constructs, total number of apposition, relative clauses and copula (feature 9), number of high frequency premodi ers (feature 14) and the maximum bigram probability (feature 12). It was interesting that presence of apposition did not select for either major or minor class. It is not surprising that the frequency of mention within and across documents were signi cant features a frequently mentioned entity will naturally be considered important for the news report. Interestingly, the syntactic form of the references was also a signi cant indicator, suggesting that the centrality of the character was signaled by the journalists by using speci c syntactic constructs in the references.</Paragraph> <Paragraph position="3"> Number of summaries Number of Number and % containing the person examples recalled by J48</Paragraph> </Section> <Section position="2" start_page="245" end_page="246" type="sub_section"> <SectionTitle> 4.2 Hearer Old vs New Results </SectionTitle> <Paragraph position="0"> The majority class prediction for the hearer-old/new classi cation task is that no one is known to the reader and it leads to overall classi cation accuracy of 54%. Using this prediction in a summarizer would result in excessive detail in referring expressions and a consequent reduction in space available to summarize the news events. The SMO prediction outperformed the baseline accuracy by 22% and is more meaningful for real tasks.</Paragraph> <Paragraph position="1"> For the hearer-old/new classi cation, the feature selection step chose the following features: the number of appositions (features 2,3) and relative clauses (feature 5), number of mentions within the document set (features 0,1), total number of apposition, relative clauses and copula (feature 10), number of high frequency premodi ers (feature 14) and the minimum bigram probability (feature 13). As in the minor-major classi cation, the syntactic choices for reference realization were useful features.</Paragraph> <Paragraph position="2"> We conducted an additional experiment to see how the hearer old/new status impacts the use of apposition or relative clauses for elaboration in references produced in human summaries. It has been observed (Siddharthan et al., 2004) that on average these constructs occur 2.3 times less frequently in human summaries than in machine summaries. As we show, the use of postmodi cation to elaborate relates to the hearer-old/new distinction.</Paragraph> <Paragraph position="3"> To determine when an appositive or relative clause can be used to modify a reference, we considered the 151 examples out of 258 where there was at least one relative clause or apposition describing the person in the input. We labeled an example as positive if at least one human summary contained an apposition or relative clause for that person and negative otherwise. There were 66 positive and 85 negative examples. This data was interesting because while for the majority of examples (56%) all the human summarizers agreed not to use postmodi cation, there were very few examples (under 5%) where all the humans agreed to postmodify. Thus it appears that for around half the cases, it should be obvious that no postmodi cation is required, but for the other half, human decisions go either way.</Paragraph> <Paragraph position="4"> Notably, none of the hearer-old persons (using test predictions of SMO) were postmodi ed. Our cognitive status predictions cleanly partition the examples into those where postmodi cation is not required, and those where it might be. Since no intuitive rule handled the remaining examples, we added the testing predictions of hearer-old/new and major/minor as features to the list in Table 1, and tried to learn this task using the tree-based learner J48. We report a testing accuracy of 71.5% (majority class baseline is 56%). There were only three useful features the predicted hearer-new/old status, the number of high frequency premodi ers for that person in the input (feature 14 in table 1) and the average number of postmodi ed initial references in the input documents (feature 17).</Paragraph> <Paragraph position="5"> 5 Validating the results on current news We tested the classi ers on data different from that provided by DUC, and also tested human consen- null sus on the hearer-new/old distinction. For these purposes, we downloaded 45 clusters from one day's output from Newsblaster4. We then automatically compiled the list of people mentioned in the machine summaries for these clusters. There were 107 unique people that appeared in the machine summaries, out of 1075 people in the input clusters.</Paragraph> </Section> <Section position="3" start_page="246" end_page="246" type="sub_section"> <SectionTitle> 5.1 Human agreement on hearer-old/new </SectionTitle> <Paragraph position="0"> A question arises when attempting to infer hearernew/old status: Is it meaningful to generalize this across readers, seeing how dependent it is on the world knowledge of individual readers? To address this question, we gave 4 American graduate students a list of the names of people in the DUC human summaries (cf. a2 3), and asked them to write down for each person, their country/state/organization af liation and their role (writer/president/attorney-general etc.). We considered a person hearer-old to a subject if they correctly identi ed both role and af liation for that person.</Paragraph> <Paragraph position="1"> For the 258 people in the DUC summaries, the four subjects demonstrated 87% agreement (a3a5a4a7a6a9a8a11a10a13a12a15a14 5. Similarly, they were asked to perform the same task for the Newsblaster data, which dealt with contemporary news6, in contrast with the DUC data that contained news from the the late 80s and early 90s. On this data, the human agreement was 91% (a3a16a4a17a6a9a8a11a10a19a18 ). This is a high enough agreement to suggest that the classi cation of national and international gures as hearer old/new across the educated adult American reader with varied interests and background in current and recent events is a well de ned task. This is not necessarily true for the full range of cognitive status distinctions; for example Poesio and Vieira (1998) report lower human agreement on more ne-grained classi cations of de nite descriptions.</Paragraph> </Section> <Section position="4" start_page="246" end_page="246" type="sub_section"> <SectionTitle> 5.2 Results on the Newsblaster data </SectionTitle> <Paragraph position="0"> We measured how well the models trained on DUC data perform with current news labeled using human and above what might be expected by pure chance (See Carletta (1996) for discussion of its use in NLP).a20a22a21a24a23 if there is perfect agreement between annotators anda20a25a21a27a26 if the annotators agree only as much as you would expect by chance.</Paragraph> <Paragraph position="1"> 6The human judgments were made within a week of the news stories appearing.</Paragraph> <Paragraph position="2"> judgment. For each person who was mentioned in the automatic summaries for the Newsblaster data, we compiled one judgment from the 4 human subjects: an example was labeled as hearer-new if two or more out of the four subjects had marked it as hearer new. Then we used this data as test data, to test the model trained solely on the DUC data.</Paragraph> <Paragraph position="3"> The classi er for hearer-old/hearer-new distinction achieved 75% accuracy on Newsblaster data labeled by humans, while the cross-validation accuracy on the automatically labeled DUC data was 76%. These numbers are very encouraging, since they indicate that the performance of the classi er is stable and does not vary between the DUC and Newsblaster data. The precision and recall for the Newsblaster data are also very similar for those obtained from cross-validation on the DUC data:</Paragraph> </Section> <Section position="5" start_page="246" end_page="246" type="sub_section"> <SectionTitle> 5.3 Major/Minor results on Newsblaster data </SectionTitle> <Paragraph position="0"> For the Newsblaster data, no human summaries were available, so no direct indication on whether a human summarizer will mention a person in a summary was available. In order to evaluate the performance of the classi er, we gave a human annotator the list of people's names appearing in the machine summaries, together with the input cluster and the machine summary, and asked which of the names on the list would be a suitable keyword for the set (keyword lists are a form of a very short summary).</Paragraph> <Paragraph position="1"> Out of the 107 names on the list, the annotator chose 42 as suitable for descriptive keyword for the set.</Paragraph> <Paragraph position="2"> The major/minor classi er was run on the 107 examples; only 40 were predicted to be major characters. Of the 67 test cases that were predicted by the classi er to be minor characters, 12 (18%) were marked by the annotator as acceptable keywords. In comparison, of the 40 characters that were predicted to be major characters by the classi er, 30 (75%) were marked as possible keywords. If the keyword selections of the annotator are taken as ground truth, the automatic predictions have precision and recall of 0.75 and 0.71 respectively for the major class.</Paragraph> </Section> </Section> class="xml-element"></Paper>