File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1093_metho.xml
Size: 20,308 bytes
Last Modified: 2025-10-06 14:08:40
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1093"> <Title>Summarizing Encyclopedic Term Descriptions on the Web</Title> <Section position="3" start_page="2" end_page="3" type="metho"> <SectionTitle> 3 Summarization Method </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.1 Overview </SectionTitle> <Paragraph position="0"> Given a set of paragraph-style descriptions for a single term in a specific domain (e.g., descriptions for &quot;hub&quot; in the computer domain), our summarization method produces a concise text describing the term from different viewpoints.</Paragraph> <Paragraph position="1"> These descriptions are obtained by the organization module in Figure 2. Thus, the related term extraction module is independent of our summarization method.</Paragraph> <Paragraph position="2"> Our method is multi-document summarization (MDS) (Mani, 2001). Because a set of input documents (in our case, the paragraphs for a single term) were written by different authors and/or different time, the redundancy and divergence of the topics in the input are greater than that for single document summarization. Thus, the recognition of similarity and difference among multiple contents is crucial. The following two questions have to be answered: * by which language unit (e.g., words, phrases, or sentences) should two contents be compared? * by which criterion should two contents be regarded as &quot;similar&quot; or &quot;different&quot;? The answers for these questions can be different depending on the application and the type of input documents.</Paragraph> <Paragraph position="3"> Our purpose is to include as many viewpoints as possible in a concise description. Thus, we compare two contents on a viewpoint-by-viewpoint basis. In addition, if two contents are associated with the same viewpoint, we determine that those contents are similar and that they should not be repeated in the summary.</Paragraph> <Paragraph position="4"> Our viewpoint-based summarization (VBS) method consists of the following four steps: 1. identification, which recognizes the language unit associated with a viewpoint, 2. classification, which merges the identified units associated with the same viewpoint into a single group, 3. selection, which determines one or more representative units for each group, 4. presentation, which produces a summary in a specific format.</Paragraph> <Paragraph position="5"> The model is similar to those in existing MDS methods. However, the implementation of each step varies depending on the application. We elaborate on the four steps in Sections 3.2-3.5, respectively.</Paragraph> </Section> <Section position="2" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 3.2 Identification </SectionTitle> <Paragraph position="0"> The identification module recognizes the language units, each of which describes a target term from a specific viewpoint. However, a compound or complex sentence is often associated with multiple viewpoints. The following example is an English translation of a Japanese compound sentence in a Web page.</Paragraph> <Paragraph position="1"> XML is an abbreviation for eXtensible Markup Language, and is a markup language. null The first and second clauses describe XML from the abbreviation and definition viewpoints, respectively. It should be noted that because &quot;XML&quot; and &quot;eXtensible Markup Language&quot; are spelled out by the Roman alphabet in the original sentence, the first clause does not provide Japanese readers with the definition of XML.</Paragraph> <Paragraph position="2"> To extract the language units on a viewpoint-by-viewpoint basis, we segment Japanese sentences into simple sentences. However, sentence segmentation remains a difficult problem and the accuracy is not 100%. First, we analyze the syntactic dependency structure of an input sentence by CaboCha</Paragraph> <Paragraph position="4"> ond, we use hand-crafted rules to extract simple sentences using the dependency structure.</Paragraph> <Paragraph position="5"> The simple sentences excepting the first clause often lack the subject. To resolve this problem, zero pronoun detection and anaphora resolution can be used. However, due to the rudimentary nature of existing methods, we use hand-crafted rules to complement simple sentences with the subject.</Paragraph> <Paragraph position="6"> As a result, we can obtain the following two simple sentences from the above-mentioned input sentence, in which the complement subject is in parentheses.</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.3 Classification </SectionTitle> <Paragraph position="0"> The classification module merges the simple sentences related to the same viewpoint into a single group. An existing encyclopedia for technical terms uses approximately 30 obligatory and optional viewpoints. We selected the following 12 viewpoints for which typical expressions can be coded manually: http://cl.aist-nara.ac.jp/~taku-ku/software/cabocha/ definition, abbreviation, exemplification, purpose, synonym, reference, product, advantage, drawback, history, component, function.</Paragraph> <Paragraph position="1"> We manually produced 36 linguistic patterns used to describe terms from a specific viewpoint. These patterns are regular expressions, in which specific morphemes are generalized into parts-of-speech or the special symbol representing the target term. We use a two-stage classification method. First, the simple sentences that match with a pattern are classified into the associated viewpoint group. A simple sentence that matches with patterns for multiple viewpoints is classified into every possible group.</Paragraph> <Paragraph position="2"> However, the pattern-based method fails to classify the sentences that do not match with any predefined patterns. Thus, second we classify the remaining sentences into the group in which the most similar sentence has already been classified. In practice, we compute the similarity between an unclassified sentence and each of the classified sentences. The similarity between two sentences is determined by the Dice coefficient, i.e., the ratio of content words commonly included in those sentences. The sentences unclassified through the above method are classified into the &quot;miscellaneous&quot; group. In summary, our two-stage method uses predefined linguistic patterns and statistics of words. The following examples are English translations of Japanese sentences extracted in the identification module. These sentences can be classified into a specific group on the ground of the underlined expressions, excepting sentence (e). However, in the second stage, sentence (e) can be classified into the history group, because sentence (e) is most similar to sen- null tence (c).</Paragraph> <Paragraph position="3"> (a) XML is an extensible markup language.</Paragraph> <Paragraph position="4"> - definition (b) an abbreviation for eXtensible Markup Language null - abbreviation (c) was advised as a standard by W3C in 1998 - history (d) XML is an abbreviation for Extensible Markup Language - abbreviation (e) the standard of XML was advised by W3C - ??? - history</Paragraph> </Section> <Section position="4" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.4 Selection </SectionTitle> <Paragraph position="0"> The selection module determines one or more representative sentences for each viewpoint group. The number of sentences selected from each group can vary depending on the desired size of the resultant summary.</Paragraph> <Paragraph position="1"> We consider the following factors to compute the score for each sentence and select sentences with greater scores in each group.</Paragraph> <Paragraph position="2"> * the number of common words included (W) The representative sentences should contain many words that are common in the group. We collect the frequencies of words for each group, and sentences including frequent words are preferred. null * the rank in Cyclone (R) As depicted in Figure 2, Cyclone sorts the retrieved paragraphs according to the plausibility as the description. Sentences in highly-ranked paragraphs are preferred.</Paragraph> <Paragraph position="3"> * the number of characters included (C) To minimize the size of a summary, short sentences are preferred.</Paragraph> <Paragraph position="4"> Because these factors are different in terms of the dimension, range, and polarity, we normalize each factor in [0,1] and compute the final score as a weighed average of the three factors. The weight of each factor was determined by a preliminary study. In brief, the relative importance among the three factors is W>R>C.</Paragraph> <Paragraph position="5"> However, because the miscellaneous group includes various viewpoints, we use a different method from that for the regular groups. First, we select representative sentences from the regular groups. Second, from the miscellaneous group, we select the sentence that is most dissimilar to the sentences already selected as representatives. We use the Dice-based similarity used in Section 3.3 to measure the dissimilarity between two sentences. If we select more than one sentence from the miscellaneous group, the second process is repeated recursively.</Paragraph> </Section> <Section position="5" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.5 Presentation </SectionTitle> <Paragraph position="0"> The presentation module lists the selected sentences without any post-editing. Ideally, natural language generation is required to produce a coherent text by, for example, complementing conjunctions and generating anaphoric expressions. However, a simple list of sentences is also useful to obtain knowledge about a target term.</Paragraph> <Paragraph position="1"> Figure 3 depicts an example summary produced from the top 50 paragraphs for the term &quot;XML&quot;. In this figure, six viewpoint groups and the miscellaneous group were formed and only one sentence was selected from each group. The order of sentences presented was determined by the score computed in the selection module.</Paragraph> <Paragraph position="2"> While the source paragraphs consist of 11,224 characters, the summary consists of 397 characters, which is almost the same length as an abstract for a technical paper.</Paragraph> <Paragraph position="3"> The following is an English translation of the sentences in Figure 3. Here, the words spelled out by the Roman alphabet in the original sentences are in * definition: XML is an extensible markup language (eXtensible Markup Language).</Paragraph> <Paragraph position="4"> * abbreviation: an abbreviation for Extensible Markup Language (an extensible markup language). null * purpose: Because XML is a standard specifi null cation for data representation, the data defined by XML can be reusable, irrespective of the upper application.</Paragraph> <Paragraph position="5"> * advantage: XML is advantageous to developers of the file maker Pro, which needs to receive data from the client.</Paragraph> <Paragraph position="6"> XML, which has recently been paid much attention as the next generation Internet standard format, and related technologies.</Paragraph> <Paragraph position="7"> * miscellaneous:InXML, the tags are enclosed in &quot;<&quot; and &quot;>&quot;.</Paragraph> <Paragraph position="8"> Each viewpoint label or sentence is hyper-linked to the associated group or the source paragraph, respectively, so that a user can easily obtain more information on a specific viewpoint. For example, by the reference sentence, a catalogue page of the book in question can be retrieved.</Paragraph> <Paragraph position="9"> Although the resultant summary describes XML from multiple viewpoints, there is a room for improvement. For example, the sentences classified into the definition and abbreviation viewpoints include almost the same content.</Paragraph> </Section> </Section> <Section position="4" start_page="3" end_page="5" type="metho"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.1 Methodology </SectionTitle> <Paragraph position="0"> Existing methods for evaluating summarization techniques can be classified into intrinsic and extrinsic approaches.</Paragraph> <Paragraph position="1"> In the intrinsic approach, the content of a summary is evaluated with respect to the quality of a text (e.g., coherence) and the informativeness (i.e., the extent to which important contents are in the summary). In the extrinsic approach, the evaluation measure is the extent to which a summary improves the efficiency of a specific task (e.g., relevance judgment in text retrieval).</Paragraph> </Section> <Section position="2" start_page="3" end_page="5" type="sub_section"> <SectionTitle> In DUC </SectionTitle> <Paragraph position="0"> and NTCIR , both approaches have been used to evaluate summarization methods targeting newspaper articles. However, because there was no public test collections targeting term descriptions in Web pages, we produced our test collection. http://research.nii.ac.jp/ntcir/index-en.html As the first step of our summarization research, we addressed only the intrinsic evaluation. In this paper, we focused on including as many viewpoints (i.e., contents) as possible in a summary, but did not address the text coherence. Thus, we used the informativeness of a summary as the evaluation criterion. We used the following two measures, which are in the trade-off relation.</Paragraph> <Paragraph position="1"> Here, &quot;#viewpoints&quot; denotes the number of view-point types. Even if a summary contains multiple sentences related to the same viewpoint, the numerator is increased by 1.</Paragraph> <Paragraph position="2"> We used 15 Japanese term in an existing computer dictionary as test inputs. English translations of the test inputs are as follows: 10BASE-T, ASCII, SQL, XML, accumulator, assembler, binary number, crossing cable, data warehouse, macro virus, main memory unit, parallel processing, resolution, search time, thesaurus.</Paragraph> <Paragraph position="3"> To calculate the coverage, the simple sentences in the Cyclone results have to be associated with viewpoints. To reduce the subjectivity in the evaluation, for each of the 15 terms, we asked two college students (excluding the authors of this paper) to annotate each simple sentence in the top 50 paragraphs with one or more viewpoints. The two annotators performed the annotation task independently. The denominators of the compression ratio and coverage were calculated by the top 50 paragraphs.</Paragraph> <Paragraph position="4"> During a preliminary study, the authors and annotators defined 28 viewpoints, including the 12 viewpoints targeted in our method. We also defined the following three categories, which were not considered as a viewpoint: * non-description, which were also used to annotate non-sentence fragments caused by errors in the identification module, * description for a word sense independent of the computer domain (e.g., &quot;hub&quot; as a center, instead of a network device), * miscellaneous.</Paragraph> <Paragraph position="5"> It may be argued that an existing hand-crafted encyclopedia can be used as the standard summary. However, paragraphs in Cyclone often contain viewpoints not described in existing encyclopedias. Thus, we did not use existing encyclopedias in our experiments.</Paragraph> </Section> </Section> <Section position="5" start_page="5" end_page="5" type="metho"> <SectionTitle> 4.2 Results </SectionTitle> <Paragraph position="0"> Table 1 shows the compression ratio and coverage for different methods, in which &quot;#Reps&quot; and &quot;#Chars&quot; denote the number of representative sentences selected from each viewpoint group and the number of characters in a summary, respectively. We always selected five sentences from the miscellaneous group.</Paragraph> <Paragraph position="1"> The third column denotes the compression ratio.</Paragraph> <Paragraph position="2"> The remaining columns denote the coverage on a annotator-by-annotator basis. The columns &quot;12 Viewpoints&quot; and &quot;28 Viewpoints&quot; denote the case in which we focused only on the 12 viewpoints targeted in our method and the case in which all the 28 viewpoints were considered, respectively.</Paragraph> <Paragraph position="3"> The columns &quot;VBS&quot; and &quot;Lead&quot; denote the coverage obtained with our viewpoint-based summarization method and the lead method. The lead method, which has often been used as a baseline method in past literature, systematically extracted the top N characters from the Cyclone result. Here, N is the same number in the second column.</Paragraph> <Paragraph position="4"> In other words, the compression ratio of the VBS and lead methods was standardized, and we compared the coverage of both methods. The compression ratio and coverage were averaged over the 15 test terms.</Paragraph> <Paragraph position="5"> Suggestions which can be derived from Table 1 are as follows.</Paragraph> <Paragraph position="6"> First, in the case of &quot;#Reps=1&quot;, the average size of a summary was 616 characters, which is marginally longer than an abstract for a technical paper. In the case of &quot;#Reps=3&quot;, the average summary size was 1309 characters, which is almost the maximum size for a single description in hand-crafted encyclopedias. A summary obtained with four sentences in each group is perhaps too long as term descriptions.</Paragraph> <Paragraph position="7"> Second, the compression ratio was roughly 10%, which is fairly good performance. It may be argued that the compression ratio is exaggerated. That is, although paragraphs ranked higher than 50 can potentially provide the sufficient viewpoints, the top 50 paragraphs were always used to calculate the dominator of the compression ratio.</Paragraph> <Paragraph position="8"> We found that the top 38 paragraphs, on average, contained all viewpoint types in the top 50 paragraphs. Thus, the remaining 12 paragraphs did not provide additional information. However, it is difficult for a user to determine when to stop reading a retrieval result. In existing evaluation workshops, such as NTCIR, the compression ratio is also calculated using the total size of the input documents. Third, the VBS method outperformed the lead method in terms of the coverage, excepting the case of &quot;#Reps=1&quot; focusing on the 12 viewpoints by annotator B. However, in general the VBS method produced more informative summaries than the lead method, irrespective of the compression ratio and the annotator.</Paragraph> <Paragraph position="9"> It should be noted that although the VBS method targets 12 viewpoints, the sentences selected from the miscellaneous group can be related to the remaining 16 viewpoints. Thus, even if we focus on the 28 viewpoints, the coverage of the VBS method can potentially increase.</Paragraph> <Paragraph position="10"> It should also be noted that all viewpoints are not equally important. For example, in an existing encyclopedia (Nagao and others, 1990) the definition, exemplification, and synonym are regarded as the obligatory viewpoints, and the remaining viewpoints are optional.</Paragraph> <Paragraph position="11"> We investigated the coverage for the three obligatory viewpoints. We found that while the coverage for the definition and exemplification ranged from 60% to 90%, the coverage for the synonym was 50% or less.</Paragraph> <Paragraph position="12"> A low coverage for the synonym is partially due to the fact that synonyms are often described with parentheses. However, because parentheses are used for various purposes, it is difficult to identify only synonyms expressed with parentheses. This problem needs to be further explored.</Paragraph> </Section> class="xml-element"></Paper>