File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1085_metho.xml
Size: 19,324 bytes
Last Modified: 2025-10-06 14:07:50
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1085"> <Title>Detecting Shifts in News Stories for Paragraph Extraction</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 SVMs </SectionTitle> <Paragraph position="0"> We use a supervised learning technique, SVMs(Vapnik, 1995), in the tracking and paragraph extraction task. SVMs are defined over a vector space where the problem is to find a decision surface that 'best' separates a set of positive examples from a set of negative examples by introducing the maximum 'margin' between two sets. Figure 2 illustrates a simple problem that is linearly separable.</Paragraph> <Paragraph position="1"> Solidline denotes a decision surface, and twodashed lines refer to the boundaries. The extra circles are called support vectors, and their removal would change the decision surface. Precisely, the decision surface for linearly separable space is a hyperplane whichcanbe writtenas w*x+b =0,where x isanarbitrary data point(x[?]R n ) and w and b are learned from a training set. In the linearly separable case maximizingthe margin can be expressed as an optimization problem:</Paragraph> <Paragraph position="3"> is a label corresponding the i-th training example. In formula (2), each element of w, w k [?] k [?] n) corresponds to each word in the training examples, and the larger value of w</Paragraph> <Paragraph position="5"> is, the more the word w k features positive examples. We use an upper bound value, E prime loo of the leave-one-out error of SVMs to estimate the optimal window size in the training data. E prime loo can estimate the performance of a classifier. It is based on the idea of leave-one-out technique: The first example is removed from l training examples. The resulting exampleis used for training,and aclassifier is induced. The classifier is tested on the held out example. The process is repeated for all training examples. The number of errors divided by l, E loo , is the leave-one-out estimate of the generalization error. E prime loo uses an upper bound on E loo instead of calculating them, which is computationally very expensive. Recall that the removal of support vectors change the decision surface. Thus the worst happens when every support vector will become an error. Let l be the number of training examples of a set S, and m be the number ofsupport vectors. E</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Tracking by Window Adjustment </SectionTitle> <Paragraph position="0"> Like much previous research, our hypotheses regardingevent tracking is that exploitingtime willlead to improveddataadjustment because documents closer together in the stream are more likely to discuss related subject than documents further apart. Let vectorx be negative training documents. The algorithm can be summarized as follows: 1. Scoring negative training documents In the TDT tracking task, the number of labelled positive training documents is small (at most 16 documents) compared to the negative training documents. Therefore, the choice of good training data isan importantissue toproduce optimalresults. We first represent each document as a vector in an n dimensional space, where n is the number of words in the collection. The cosine of the angle between two vectors, vectorx</Paragraph> <Paragraph position="2"> are the term frequency of word k in the document vectorx</Paragraph> <Paragraph position="4"> , respectively. We compute arelevance score foreach negativetrainingdocument by the cosine of the angle between a vector of the center of gravity on positive training documents and a vector of the negative training document, i.e.,</Paragraph> <Paragraph position="6"> is the j-th negative training document, and vectorg is defined as follows:</Paragraph> <Paragraph position="8"> in the positive document vectorx In this case, we need to find the optimalwindowsize so as to include only the positive documents which are sufficiently related to the subject. The flow of the algorithm is shown in Figure 3.</Paragraph> <Paragraph position="9"> is regarded to discuss a new subject. We use all previously seen positive documents for training as a default strategy.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. Tracking </SectionTitle> <Paragraph position="0"> Let num be the number of adjusted positive training documents. The top num negative documents areextracted fromq negativedocumentsandmerged intonum positivedocuments. Thenewset istrained by SVMs, and a classifier is induced. Recall that</Paragraph> <Paragraph position="2"> is computationally less expensive. However, they are sometimes too tight for the small size of training data. This causes a high F/A rate which is signaled by the ratio of the documents that were judged as negative but were evaluated as positive.</Paragraph> <Paragraph position="3"> We then use a simple measure for the test document which is determined to be positive by a classifier.</Paragraph> <Paragraph position="4"> For each training document, we compute the cosine between the test and the training document vectors.</Paragraph> <Paragraph position="5"> If the cosine between the test and the negative documents is largest, the test document is judged to be negative. Otherwise, itis truly positive andtracking is terminated. The procedure 1, 2 and 3 is repeated until the last test document is judged.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Paragraph Extraction </SectionTitle> <Paragraph position="0"> Our window adjustment algorithm is applied each time the document discusses the target event.</Paragraph> <Paragraph position="1"> Therefore, some documents are assigned to more than one set of documents. We thus eliminate some sets which completely overlap each other, and apply paragraph extraction to the result. Our hypothesis about key paragraphs is that they include subject, subject class, and event words. Let x p be a paragraph in the document x and x</Paragraph> <Paragraph position="3"> removed. Let also l be the total number of documents in a set where each document discusses subjects related tox.Ifx</Paragraph> <Paragraph position="5"> rather than the other l-1 documents, since subject words appear across paragraphs in x rather than the other l[?]1 documents. We apply SVMs to the training data, which consists ofl documents, andinduce aclassifier sbj(x ing binary classification, while our paragraph extraction is a multi-class classification problem, i.e., l classes. We use the pairwise technique for using SVMs withmulti-class data(Weston andC.Watkins, 1998), and assign x p to one of the l documents. In a similar way, we apply SVMs to the other two training data and induce classifiers: sbj class(x , and concerning the target event. The other is a set of documents which are not the target event. We extract paragraphs for which (6) holds.</Paragraph> <Paragraph position="7"/> </Section> </Section> <Section position="6" start_page="0" end_page="1" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> We used the TDT1 corpus which comprises a set of different sources, Reuters(7,965 documents) and CNN(7,898 documents)(Allan et al., 1998a). A set of 25 target events were defined. Each document is labeled according to whether or not the document discusses the target event. All 15,863 documents were tagged by a part-of-speech tagger(Brill, 1992) and stemmed using WordNet information(Fellbaum, 1998). We extracted all nouns in the documents.</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 5.1 Tracking Task </SectionTitle> <Paragraph position="0"> Table 1 summarizes the results which were obtained using the standard TDT evaluation measure</Paragraph> <Paragraph position="2"> takes on value 1, we use the document d and one negative training document vectory for training.</Paragraph> <Paragraph position="3"> Here, vectory is a vector whose cosine value of d and vectory is the largest among the other negative documents. The test set is always the collection minus the N</Paragraph> <Paragraph position="5"> 16documents. 'Miss'denotes Miss rate, which isthe ratio of the documents that were judged as Yes but were not evaluated as Yes. 'F/A' shows false alarm rate, which is the ratio of the documents judged as No but were evaluated as Yes. 'Prec' stands for precision, which is the ratio of correct assignments by the system divided by the total number of the system's assignments. 'F1' is a measure that balances recall and precision, where recall denotes the ratio of correct assignments by the system divided by the total number of correct assignments. Table 1 shows that there is no significant difference among N t values except for 1, since F1 ranges from 0.78 to 0.79. This shows that the method works well even for a small number of initial positive training documents. Furthermore, the results are comparable to the existing event tracking techniques, since the F1, Miss and F/A score by CMU were 0.66, 29 and 0.40, and those of UMass were 0.62, 39 and 0.27, respectively, when N t is 4(Allan et al., 1998b).</Paragraph> <Paragraph position="6"> The contribution of the adaptive window algorithm is best explained by looking at the window sizes it estimates. Table 2 illustrates the sample result oftracking for 'Kobe Japan Quake' event on the</Paragraph> <Paragraph position="8"> = 16. This event has many documents, each of these discusses a new subject related to the target event. The result shows the first 10 documents in chronological order which are evaluated as positive.</Paragraph> <Paragraph position="9"> Columns 1-3 in Table 2 denote id number, dates, and title of the document, respectively. 'id=1', for example, denotes the first document which is evaluated as positive. Columns 4 and 5 stand for the result of our method, and the majority of three humanjudges, respectively. They take on three values: 'Yes' denotes that the document discusses the same subject as an earlier one, 'New' indicates that the see that the method correctly recognizes a test document as discussing an earlier subject or a new one, since the result of our method('system') and human judges('actual') coincide except for 'id=5, 6 and 7'. Columns 6-8 stand for the accuracy of the adjusted window size. Recall denotes the number of documents selected by both the system and human judges divided by the total number of documents selected by human judges, and precision shows the number of documents selected by both the system and human judges divided by the total number of documents selected by the system.</Paragraph> <Paragraph position="10"> When the method correctly recognizes a test document as discussing an earlier subject('system = actual = Yes'), our algorithmselects documents which are sufficiently related to the current subject, since the total average of F1 was 0.82. We note that the ratio of precision in 'system = New' is low. This is because we use a default strategy, i.e., we use all previously seen positive documents for training when the most recent training document is judged to discuss a new subject.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 5.2 Paragraph Extraction </SectionTitle> <Paragraph position="0"> Weused15outof25events whichhavemorethan16 positive documents in the experiment. Table 3 denotes the number of documents and paragraphs in each event. 'Avg.' in 'doc' shows the average number of documents per event, and 'Avg.' in 'para' denotes the average number of paragraphs per document. The maximum number of paragraphs per document was 100.</Paragraph> <Paragraph position="1"> Table 4 shows the result of paragraph extraction. 'CNN' refers to the results using the CNN corpus as both training and test data. 'Reuters' denotes the results using the Reuters corpus. 'Total' stands for the results using both corpora. 'Tracking result' refers to the F1 score obtained by using tracking results. 'Perfect analysis' stands for the F1 achieved using the perfect (post-edited) output of the tracking method, i.e., the errors by both tracking and detecting shifts were corrected. Precisely, the documents judged as Yes but were not evaluated as Yes were eliminated, and the documents judged as No but were evaluated as Yes were added. Further, the documents were divided by a human into several sets, each of which covers a different subject related to the same event. The evaluation is made by three humans. The classification is determined to be correct if the majority of three human judges agrees. Table 4 shows that the average F1 of 'Tracking results'(0.68) in 'Total'was 0.06 lower than that of 'Perfect analysis'(0.74). Overall, the result using 'CNN' was better than that of 'Reuters'. One reason behind this lies in the difference between the two corpora: CNN consists of a larger number of words per paragraph than Reuters. This causes a high recall rate, since a paragraph which consists of a large number of words is more likely to include event, subject-class, and subject words than a paragraph containing a small number of words.</Paragraph> <Paragraph position="2"> Recall that in SVMs each value of word w k is calculated using formula (2), and the larger value of w k is, the more the word w k features positive examples. Table 5 illustrates sample words which ) is the result using both corpora. The event is the Kobe Japan quake, and the document which includes x p states that the death toll has risen to over 800 in the Kobe-Osaka earthquake, and officials are concentrating on getting people out. 'Words' denote words which have the highest weighted valuein each classifier and they are used to determine whether x p is a key paragraph or not. We assume these words are subject, subject class and event words, while some words such as 'earthquake' and 'activity' appear in more than one classifier. classifier words sbj(x p ) earthquake activity Japan seismologist news conferenceliving prime minister Murayama crew Bill Dorman sbj class(x p ) city something floor quake Tokyo aftershock activity street injury fire seismologist police people building cry event(x p ) Kobe magnitude survivor earthquake collapse death fire damage aftershock Kyoto toll quake magnitude emergency Osaka-</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> Kobe Japan Osaka </SectionTitle> <Paragraph position="0"> influences extraction accuracy. The event is the US-Air 427 crash, and F1 is 0.68, which is lower than theaverageF1ofallevents(0.79). Theresultiswhen</Paragraph> <Paragraph position="2"> is 16. 'P ana of tracking' refers to the result using the post-edited output of the tracking, i.e., only the errors of tracking were corrected, while 'Perfect analysis' refers to the result using the output: the errors by both tracking and detecting shifts were corrected. Figure 4 shows that our method does not depend on the number of documents, since the performance does not monotonically decrease when the number of documents increases. Figure 4 also shows that there is no significant difference between 'P ana of tracking' and 'Perfect analysis' compared tothedifference between 'Trackingresults' and'Perfect analysis'. This indicates that (i) subject shifts are correctly detected, and (ii) the performance of our paragraph extraction explicitly depends on the tracking results.</Paragraph> <Paragraph position="3"> We note the contribution of detecting shifts for paragraph extraction. Figures 5 and 6 illustrate the recall and precision with two methods: with and without detecting shift. In the method without detecting shift, we use the 'full memory' approach for tracking, i.e.,SVMs generate its classification model fromallpreviouslyseen documents. Fortheresult of tracking, we extract paragraphs for which sbj(x ) = 1 hold. We can see from both Figure 5 and Figure 6 that the method that detects shifts outperformed the method without detecting shifts in all N t values. More surprisingly, Figure 6 shows that the precision scores in all N t values using the tracking results with detecting shift were higher than that of 'P ana' without detecting shift. Further, the difference in precision between two methods is larger than that of recall. This demonstrates that it is necessary to detect subject shifts and thus to identify subject class words for paragraph extraction, since the system without detecting shift extracts many documents, which yields redundancy.</Paragraph> </Section> </Section> <Section position="7" start_page="1" end_page="1" type="metho"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> Most of the work on summarization task by paragraph or sentence extraction has applied statistical techniques based on word distribution to the target document(Kupiec et al.,1995). More recently, other approaches have investigated the use of machine learning to find patterns in documents(Strzalkowski et al., 1998) and the utility of parameterized modules so as to deal with different genres or corpora(Goldstein et al., 2000). Some of these approaches to single document summarization have been extended to deal with multi-document summarization(Mani and E.Bloedorn, 1997), (Barzilay et al., 1999), (McKeown et al., 1999).</Paragraph> <Paragraph position="1"> Our work differs from the earlier work in several important respects. First, our method focuses on subject shift of the documents from the target event rather than the sets of documents from different events(Radev et al., 2000). Detecting subject shift from the documents in the target event, however, presents special difficulties, since these documents are collected froma veryrestricted domain. Wethus present a window adjustment algorithmwhich automaticallyadjusts the optimalwindowinthe training documents, so as to include only the data which are sufficiently related to the current subject. Second, our approach works in a living way, while many approaches are stable ones, i.e., they use documents which are prepared in advance and apply a variety oftechniques tocreate summaries. Weare interested in a substantially smaller number of initial training documents, which are then utilized to extract paragraphs from documents relevant to the initial documents. Because the small number of documents which are used for initial training is easy to collect, and costly human intervention can be avoided.</Paragraph> <Paragraph position="2"> To do this, we use a tracking technique. The small size of the training corpus, however, requires sophisticated parameters tuning for learning techniques, since we can not make one or more validation sets of documents from the initial training documents which are required for optimal results. Instead we use E prime loo of SVMs to cope with this problem. Further, our method does not use specific features for training such as 'Presence and type of agent' and 'Presence of citation', which makes it possible to be extendable to other domains(Teufel, 2001).</Paragraph> </Section> class="xml-element"></Paper>