File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1053_metho.xml
Size: 6,200 bytes
Last Modified: 2025-10-06 14:07:34
<?xml version="1.0" standalone="yes"?> <Paper uid="H01-1053"> <Title>Monitoring the News: a TDT demonstration system</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. DEMONSTRATION DATA </SectionTitle> <Paragraph position="0"> The data for this demonstration was taken from the our TDT 2000 evaluation output on the TDT cluster detection task [8]. The sytem is running on the TDT-3 evaluation collection of news articles, approximately 40,000 news stories spanning October 1 through December 31, 1998.</Paragraph> <Paragraph position="1"> We simulated incremental arrival of the data as follows. At the end of each day in the collection, we looked at the incremental output of the TDT detection system. At this point, every story has been classiPSed into a cluster. Every story seen to date is in one of the clusters for that day, even if the cluster has the same contents as it did yesterday.</Paragraph> <Paragraph position="2"> The demonstration is designed to support text summarization tools that could help a user understand the content of the cluster. For our purposes, each cluster was analyzed to construct the fol- null 1. The title was generated by selecting the 10 most commonly occurring non-stopwords throughout the cluster. A better title would probably be the headline of the most &quot;representative&quot; news story, though this is an open research question. 2. The summary was generated by selecting the PSve sentences that were most representative of the entire cluster. Better approaches might generate a summary from the multiple documents [9] or summarize the changes from the previous day [5, 2].</Paragraph> <Paragraph position="3"> 3. The contents of the cluster is just a list of every story in the cluster, presented in reverse chronological order. Various alternative presentations are possible, including leveraging the multimedia (radio and television) that is the basis for the TDT data.</Paragraph> <Paragraph position="4"> The demonstration system was setup so that it could move from between the days. All of the input to the client was generated automatically, but we saved the information so that it could be shown more quickly. It typically takes a few minutes to generate all of the presentation information for a single day's clusters.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. DEMONSTRATION SYSTEM </SectionTitle> <Paragraph position="0"> Figure 1 shows the client window. This snapshot shows the system on October 31 at 10:00pm, approximately four weeks into the data. The status line on the lower-left shows that at this point the system has already encountered almost 16,000 stories and has broken them into about 2400 topic clusters.</Paragraph> <Paragraph position="1"> The system is showing the 50 topics with the largest number of stories. The ranked list (by size) starts on the upper-left, shows the PSrst 25, and the continues in the upper-right. The &quot;title&quot; for each of those topics is generated in this case by the most common words within the cluster. Any system that does a better job of building a title for a large cluster of stories could be used to improve this capability.</Paragraph> <Paragraph position="2"> In addition to the ranked list of topics, the system computes intertopic similarities and depicts that using the spheres in the middle. If two topics are highly similar, their spheres will appear near each other in the visualization. This allows related topics to be detected quickly. Because the 50 largest topics are shown, the topics are more unalike than they would be with a wider range, but it is still possible to see, for example, that topics about the Clinton presidency are near each other (the cyan pair of spheres overlapping rank number 9, topic rank numbers 5 and 29). The spheres and the ranked list are tightly integrated, so selecting one causes the other to be highlighted.</Paragraph> <Paragraph position="3"> Topics can be assigned colors to make them easier to pick out in future sessions. In this case, the user has chosen to use the same color for a range of related topics--e.g., red for sports topics, green for weather topics, etc. The color selection is in the control of the user and is not done automatically. However, once a color is assigned to a topic, the color is &quot;sticky&quot; for future sessions. A user might choose to color a critical topic bright red so that changes to it stand out in the future.</Paragraph> <Paragraph position="4"> Figure 2 shows the same visualization, but here a summary of a selected topic is shown in a pop-up balloon. This summary was generated by selecting sentences that contained large numbers of key concepts from the topic. Any summarization of a cluster could be used here if it provided more useful information.</Paragraph> <Paragraph position="5"> To illustrate how the demonstration system shows changes in TDT clusters over time, Figure 3 shows an updated visualization for two weeks later (November 14, 1998). The topic colors are persistent from Figure 1, though one of the marked topics (&quot;Strawberry cancer colon Yankee&quot;) is no longer in the largest 50 so does not appear.</Paragraph> <Paragraph position="6"> Most of the spheres include a small &quot;wedge&quot; of yellow in them. That indicates the proportion of the topic that is new stories (since Figure 1). Some topics have large numbers of new stories, so have a large yellow slice, whereas a few have a very small number of new stories, so have only a thin wedge. The yellow wedge can be as much as 50% of the sphere (which would represent an entirely new topic), and only covers the top of the sphere. This restriction ensures that the topic color is still visible.</Paragraph> <Paragraph position="7"> The controls at the top of the screen are for moving between queries, issuing a query, and returning the visualization to a &quot;home&quot; point. The next PSve controls affect the layout of the display, including allowing a 3-D display: a 3-D version of Figure 3 is shown in</Paragraph> </Section> class="xml-element"></Paper>