File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1056_metho.xml

Size: 4,046 bytes

Last Modified: 2025-10-06 14:07:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1056">
  <Title>NewsInEssence: A System For Domain-Independent, Real-Time News Clustering and Multi-Document Summarization</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. THE NEWSINESSENCE SYSTEM
</SectionTitle>
    <Paragraph position="0"> NewsInEssence's search agent, NewsTroll, runs in two phases.</Paragraph>
    <Paragraph position="1"> First, it looks for related articles by traversing links from the page containing the seed article. Using the seed article and any related articles it finds in this way, the agent then decides on a set of key-words for further search. In the second phase, it attempts to add to the cluster of related articles by going to the search engines of various news websites and using the keywords which it found in the first phase as search terms.</Paragraph>
    <Paragraph position="2"> In both phases, NewsTroll selectively follows hyperlinks with the aim of reaching pages which contain related stories and/or further hyperlinks to related stories pages.</Paragraph>
    <Paragraph position="3"> Both general and site-specific rules help NewsTroll determine which URLs are likely to be useful. Only if NewsTroll determines that a URL is &amp;quot;interesting&amp;quot;, will it go to the Internet to fetch the new page. A more stringent set of rules are applied to determine whether the URL is likely to be a news story itself. If so, the similarity of its text to that of the original seed page is computed using an IDF-weighted vector measure. If the similarity is above a certain threshold, the page is considered to contain a related article and added to the cluster. The user may use our web interface (Figure 2) to adjust the similarity threshold used in a given search.</Paragraph>
    <Paragraph position="4"> Using several levels of filtering, NewsTroll is able to screen out large numbers web pages quite efficiently. The expensive operation of testing lexical similarity is reserved for the small number of .</Paragraph>
    <Paragraph position="5"> pages which NewsTroll finds interesting. Consequently, the agent can return useful results in real time.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. ANNOTATED SAMPLE RUN
</SectionTitle>
    <Paragraph position="0"> The example begins when we find a news article we would like to read more about. In this case we pick a story is about a breaking story regarding one of President-Elect Bush's cabinet nominees (see Figure 1).</Paragraph>
    <Paragraph position="1"> We input the URL using the web interface of the NEWSINESSENCE system, then select our search options, click 'Proceed' and wait for our results (see Figure 2).</Paragraph>
    <Paragraph position="2"> In response to the user query, NewsTroll begins looking for related articles linked from the chosen start page. In a selection from the agent's output log in Figure 3, we can see that it extracts and tests links from the page, and decides to test one which looks like a news article. We then see that it tests this article and determines it to be related. This article is added to the initial cluster, from which the list of top keywords is drawn.</Paragraph>
    <Paragraph position="3"> In its secondary phase, NewsTroll inputs its keywords to the search engines of news sites and lets them do the work of finding stories. Since we have selected good keywords, most of the links seen by NewsTroll in this part of the search are indeed related articles (see Figure 4). Upon exiting, NewsTroll reports the number of links it has considered, followed, tested, and retrieved (see Figure 4).</Paragraph>
    <Paragraph position="4"> The system's web interface reports its progress to the user in real time and provides a link to the visualization GUI once the cluster is complete (Figure 5). Using the GUI, the user can select which of the articles to summarize (see Figures 6 and 7). Figure 8 shows the output of the cluster summarizer.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML