File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1093_intro.xml

Size: 6,619 bytes

Last Modified: 2025-10-06 14:02:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1093">
  <Title>Summarizing Encyclopedic Term Descriptions on the Web</Title>
  <Section position="2" start_page="0" end_page="2" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Term descriptions, which have been carefully organized in hand-crafted encyclopedias, are valuable linguistic knowledge for human usage and computational linguistics research. However, due to the limitation of manual compilation, existing encyclopedias often lack new terms and new definitions for existing terms.</Paragraph>
    <Paragraph position="1"> The World Wide Web (the Web), which contains an enormous volume of up-to-date information, is a promising source to obtain new term descriptions.</Paragraph>
    <Paragraph position="2"> It has become fairly common to consult the Web for descriptions of a specific term. However, the use of existing search engines is associated with the following problems: (a) search engines often retrieve extraneous pages not describing a submitted term, (b) even if desired pages are retrieved, a user has to identify page fragments describing the term, (c) word senses are not distinguished for polysemous terms, such as &amp;quot;hub (device and center)&amp;quot;, (d) descriptions in multiple pages are independent and do not comprise a condensed and coherent text as in existing encyclopedias.</Paragraph>
    <Paragraph position="3"> The authors of this paper have been resolving these problems progressively. For problems (a) and (b), Fujii and Ishikawa (2000) proposed an automatic method to extract term descriptions from the Web. For problem (c), Fujii and Ishikawa (2001) improved the previous method, so that the multiple descriptions extracted for a single term are categorized into domains and consequently word senses are distinguished.</Paragraph>
    <Paragraph position="4"> Using these methods, we have compiled an encyclopedic corpus for approximately 600,000 Japanese terms. We have also built a Web site called &amp;quot;Cyclone&amp;quot; null  to utilize this corpus, in which one or more paragraph-style descriptions extracted from different pages can be retrieved in response to a user input. In Figure 1, three paragraphs describing &amp;quot;XML&amp;quot; are presented with the titles of their source pages.</Paragraph>
    <Paragraph position="5"> However, the above-mentioned problem (d) remains unresolved and this is exactly what we intend to address in this paper.</Paragraph>
    <Paragraph position="6"> In hand-crafted encyclopedias, a single term is described concisely from different &amp;quot;viewpoints&amp;quot;, such as the definition, exemplification, and purpose. In contrast, if the first paragraph in Figure 1 is not described from a sufficient number of viewpoints for XML, a user has to read remaining paragraphs.</Paragraph>
    <Paragraph position="7"> However, this is inefficient, because the descriptions are extracted from independent pages and usually include redundant contents.</Paragraph>
    <Paragraph position="8"> To resolve this problem, we propose a summarization method that produces a concise and condensed term description from multiple paragraphs. As a result, a user can obtain sufficient information about a term with a minimal cost. Additionally, by reducing the size of descriptions, Cyclone can be used with mobile devices, such as PDAs.</Paragraph>
    <Paragraph position="9"> However, while Cyclone includes various types of terms, such as technical terms, events, and animals, the required set of viewpoints can vary depending the type of target terms. For example, the definition and exemplification are necessary for technical terms, but the family and habitat are necessary for animals. In this paper, we target Japanese technical terms in the computer domain.</Paragraph>
    <Paragraph position="10"> Section 2 outlines Cyclone. Sections 3 and 4 explain our summarization method and its evaluation, respectively. In Section 5, we discuss related work and the scalability of our method.</Paragraph>
    <Paragraph position="11">  which produces an encyclopedic corpus by means of five modules: &amp;quot;term recognition&amp;quot;, &amp;quot;extraction&amp;quot;, &amp;quot;retrieval&amp;quot;, &amp;quot;organization&amp;quot;, and &amp;quot;related term extraction&amp;quot;. While Cyclone produces a corpus off-line, users search the resultant corpus for specific descriptions on-line.</Paragraph>
    <Paragraph position="12"> It should be noted that the summarization method proposed in this paper is not included in Figure 2 and that the concept of viewpoint has not been used in the modules in Figure 2.</Paragraph>
    <Paragraph position="13"> In the off-line process, the input terms can be either submitted manually or collected by the term recognition module automatically. The term recognition module periodically searches the Web for morpheme sequences not included in the corpus, which are used as input terms.</Paragraph>
    <Paragraph position="14"> The retrieval module exhaustively searches the Web for pages including an input term, as performed in existing Web search engines.</Paragraph>
    <Paragraph position="15"> The extraction module analyzes the layout (i.e., the structure of HTML tags) of each retrieved page and identifies the paragraphs that potentially describe the target term. While promising descriptions can be extracted from pages resembling on-line dictionaries, descriptions can also be extracted from general pages.</Paragraph>
    <Paragraph position="16"> The organization module classifies the multiple paragraphs for a single term into predefined domains (e.g., computers, medicine, and sports) and sorts them according to the score. The score is computed by the reliability determined by hyper-links as in Google  and the linguistic validity determined by a language model produced from an existing machine-readable encyclopedia. Thus, different word senses, which are often associated with different domains, can be distinguished and high-quality descriptions can be selected for each domain.</Paragraph>
    <Paragraph position="17"> Finally, the related term extraction module searches top-ranked descriptions for terms strongly related to the target term (e.g., &amp;quot;cable&amp;quot; and &amp;quot;LAN&amp;quot; for &amp;quot;hub&amp;quot;). Existing encyclopedias often provide related terms for each headword, which are effective to understand the headword. In Cyclone, related terms can also be used as feedback terms to narrow down the user focus. However, this module is beyond the scope of this paper.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML