File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1078_metho.xml
Size: 17,273 bytes
Last Modified: 2025-10-06 14:15:27
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1078"> <Title>Using Linguistic Knowledge in Automatic Abstracting</Title> <Section position="3" start_page="0" end_page="596" type="metho"> <SectionTitle> 2 The Corpus </SectionTitle> <Paragraph position="0"> The production of professional abstracts has long been object of study (Cremmins, 1982). In particular, it has been argued that structural parts of parent documents such as introductions and conclusions are important in order to obtain the information for the topical sentence (Endres-Niggemeyer et al., 1995). We have been investigating which kind of information is reported in professional abstracts as well as where the information lies in parent documents and how it is conveyed. In Figure 1, we show a professional abstract from the &quot;Computer and Control Abstracts&quot; journal, this kind of abstract aims to alert readers about the existence of a new article in a particular field. The example contains information about the author's interest, the author's development and the overview of the parent document. All the information reported in this abstract was found in the introduction of its parent document.</Paragraph> <Paragraph position="1"> In order to study the aforementioned aspects, we have manually aligned sentences of 100 professional abstracts with sentences of parent documents containing the information reported in the abstract. In a previous study (Saggion and Lapalme, 1998), we have shown that 72% of the information in professional abstracts lies in titles, captions, first sections and last sections of parent documents while the rest of the information was found in author abstracts and other sections. These results suggest that some structural sections are particularly important in order to select information for an abstract but also The production of understandable and maintainable expert systems using the current generation of multiparadigm development tools is addressed. This issue is discussed in the context of COMPASS, a large and complex expert system that helps maintain an electronic telephone exchange. As part of the work on COMPASS, several techniques to aid maintainability were developed and successfully implemented. Some of the techniques were new, others were derived from traditional software engineering but modified to fit the rapid prototyping approach of expert system building. An overview of the COMPASS project is presented, software problem areas are identified, solutions adopted in the final system are described and how these solutions can be generalized is discussed. Figure h Professional Abstract: CCA 58293 (1990 vol.25 no.293). Parent Document: &quot;Maintainability Techniques in Developing Large Expert Systems.&quot; D.S. Prerau et al. IEEE Expert, vol.5, no.3, p.71-80, June 1990.</Paragraph> <Paragraph position="2"> that it is not enough to produce a good informative abstract (i.e. we hardly find the results of an investigation in the introduction of a research paper).</Paragraph> </Section> <Section position="4" start_page="596" end_page="597" type="metho"> <SectionTitle> 3 Conceptual and Linguistic </SectionTitle> <Paragraph position="0"> Information The complex process of scientific discovery that starts with the identification of a research problem and eventually ends with an answer to the problem (Bunge, 1967), would generally be disseminated in a technical or scientific paper: a complex record of knowledge containing, among others, references to the following concepts the author, the author's affiliation, others authors, the authors' development, the authors' interest, the research article and its components (sections, figures, tables, etc.), the problem under consideration, the authors' solution, others' solution, the topics of the research article, the motivation for the study, the importance of the study, what the author found, what the author think, what others have done, and so forth.</Paragraph> <Paragraph position="1"> Those concepts are systematically selected for inclusion in professional abstracts. We have noted that some of them are lexically marked while others appear as arguments of predicates conveying specific relations in the domain of discourse. For example, in an expression such as &quot;We found significant reductions in ...&quot; the verb &quot;find&quot; takes as an argument a result and in the expression &quot;The lack of a library severely limits the impact of...&quot; the verb &quot;limit&quot; entails a problem.</Paragraph> <Paragraph position="2"> We have used our corpus and a set of more than 50 complete technical articles in order to deduce a conceptual model and to gather lexical information conveying concepts and relations. Although our conceptual model does not deal with all the intricacies of the domain, we believe it covers most of the important information relevant for an abstract. In order to obtain linguistic expressions marking concepts and relation, we have tagged our corpus with a POS tagger (Foster, 1991) and we have used a thesaurus (Vianna, 1980) to semantically classify the lexical items (most of them are polysemous). Figure 2, gives an overview of some concepts, relations and lexical items so far identified.</Paragraph> <Paragraph position="3"> The information we collected allow the definition of patterns of two kinds: (i) linguistic patterns for the identification of noun groups and verb groups; and (ii) domain specific patterns for the identification of entities and relations in the conceptual model This allows for the identification of complex noun groups such as &quot;The TIGER condition monitoring system&quot; in the sentence &quot;The TIGER gas turbine condition monitoring system addresses the performance monitoring aspects&quot; and the interpretation of strings such as &quot;University of Montreal&quot; as a reference to an institution and verb forms such as &quot;have presented&quot; as a reference to a predicate possibly introducing the topic of the document. The patterns have been specified according to the linguistic constructions found in the corpus and then expanded to cope with other valid linguistic patterns, though not observed in our data.</Paragraph> <Section position="1" start_page="597" end_page="597" type="sub_section"> <SectionTitle> Concepts/Relations Explanation Lexical Items </SectionTitle> <Paragraph position="0"> make know The author mark the topic of the document describe, expose, present, ...</Paragraph> <Paragraph position="1"> study The author is engaged in study analyze, examine, explore, ...</Paragraph> <Paragraph position="2"> express interest The author is interested in address, concern, interest,...</Paragraph> <Paragraph position="3"> experiment The author is engaged in experimentation experiment, test, try out, ...</Paragraph> <Paragraph position="4"> identify goal The author identify the research goal necessary, focus on, ...</Paragraph> <Paragraph position="5"> explain The author gives explanations explain, interpret, justify,...</Paragraph> <Paragraph position="6"> define a concept is being defined define, be, ...</Paragraph> <Paragraph position="7"> describe entity is being described compose, form, ...</Paragraph> <Paragraph position="8"> authors The authors of the article We, I, author,...</Paragraph> <Paragraph position="9"> paper The technical article article, here, paper, study, ...</Paragraph> <Paragraph position="10"> institutions authors' affiliation University, UniversitY, ...</Paragraph> <Paragraph position="11"> other researchers Other researchers Proper Noun (Year), ...</Paragraph> </Section> </Section> <Section position="5" start_page="597" end_page="600" type="metho"> <SectionTitle> 4 Generating Abstracts </SectionTitle> <Paragraph position="0"> It is generally accepted that there is no such thing as an ideal abstract, but different kinds of abstracts for different purposes and tasks (McKeown et al., 1998). We aim at the generation of a type of abstract well recognized in the literature: short indicative-informative abstracts.</Paragraph> <Paragraph position="1"> The indicative part identifies the topics of the document (what the authors present, discuss, address, etc.) while the informative part elaborates some topics according to the reader's interest by motivating the topics, describing entities, defining concepts and so on. This kind of abstract could be used in tasks such as accessing the content of the document and deciding if the parent document is worth reading. Our method of automatic abstracting relies on: * the identification of sentences containing domain specific linguistic patterns; * the instantiation of templates using the selected sentences; * the identification of the topics of the document and; * the presentation of the information using re-generation techniques.</Paragraph> <Paragraph position="2"> The templates represent different kinds of information we have identified as important for inclusion in an abstract. They are classified in: indicative templates used to represent concepts and relations usually present in indicative abstracts such as &quot;the topic of the document&quot;, &quot;the structure of the document&quot;, &quot;the identification of main entities&quot;, &quot;the problem&quot;, &quot;the need for research&quot;, &quot;the identification of the solution&quot;, &quot;the development of the author&quot; and so on; and informative templates representing concepts that appear in informative abstracts such as &quot;entity/concept definition&quot;, &quot;entity/concept description&quot;, &quot;entity/concept relevance&quot;, &quot;entity/concept function&quot;, &quot;the motivation for the work&quot;, &quot;the description of the experiments&quot;, &quot;the description of the methodology&quot;, &quot;the results&quot;, &quot;the main conclusions&quot; and so on. Associated with each template is a set of rules used to identify potential sentences which could be used to instantiate the template. For example, the rules for the topic of the document template, specify to search the category make know in the introduction and conclusion of the paper while the rules for the entity description specify the search for the describe category in all the text.</Paragraph> <Paragraph position="3"> Only sentences matching specific patterns are retained in order to instantiate the templates and this reduces in part the problem of polysemy of the lexical items.</Paragraph> <Paragraph position="4"> The overall process of automatic abstracting shown in Figure 3 is composed of the following steps: The raw text is tagged and transformed in a structured representation allowing the following processes to access the structure of the text (words, groups of words, titles, sentences, paragraphs, sections, and so on). Domain specific transducers are applied in order to identify possible concepts in the discourse domain (such as the authors, the paper, references to other authors, institutions and so on) and linguistic transducers are applied in order to identify noun groups and verb groups.</Paragraph> <Paragraph position="5"> Afterwards, semantic tags marking discourse domain relations and concepts are added to the different elements of the structure.</Paragraph> <Paragraph position="6"> Additionally, the process extracts noun groups, computes noun group distribution (assigning a weight to each noun group) and generates the topical structure of the paper: a structure with n + 1 components where n is the number of sections in the document. Component i (0 < i < n) contains the noun groups extracted from the title of section i (0 indicates the title of the document). The structure is used in the selection of the content for the indicative abstract. Indicative Selection: Its function is to identify potential topics of the document and to construct a pool of &quot;propositions&quot; introducing the topics. The indicative templates are used to this end: sentences are selected, filtered and used to instantiate the templates using patterns identified during the analysis of the corpus. The instantiated templates obtained in this step constitute the indicative data base.</Paragraph> <Paragraph position="7"> Each template contains, in addition to their specific slots, the following: the topic candidate slot which is filled in with the noun groups of the sentence used for instantiation, the weight slot filled in with the sum of the weights of the noun groups in the topic candidate slot and, the position slot filled in with the position of the sentence (section number and sentence number) which instantiated the template. In Figure 4, the &quot;topic of the document&quot; template appears instantiated using the sentence &quot;this paper describes the Active Telepresence System with an integrated AR system to enhance the operator's sense of presence in hazardous environments.&quot; In order to select the content for the indicative abstract the system looks for a &quot;match&quot; between the topical structure and the templates in the indicative data base: the system tries all the matches between noun groups in the topical structure and noun groups in the topic candidate slots. One template is selected for each component of the topical structure: the template with more matches. The selected templates constitute the content of the indicative abstract and the noun groups in the topic candidate slots constitute the potential topics.</Paragraph> <Paragraph position="8"> Informative Selection: this process aims to confirm which of the potential topics computed by the indicative selection are actual topics (i.e. topics the system could informatively expand according to the reader interest) and produces a pool of &quot;propositions&quot; elaborating the topics. All informative templates are used in this step, the process considers sentences containing the potential topics and matching informative patterns. The instantiated informative templates constitute the informative data base and the potential topics appearing in the informative templates form the topics of the document.</Paragraph> <Paragraph position="9"> Generation: This is a two step process.</Paragraph> <Paragraph position="10"> First, in the indicative generation, the templates selected by the indicative selection are presented to the reader in a short text which contains the topics identified by the informative selection and the kind of information the user could ask for. Second, in the informative generation, the reader selects some of the topics asking for specific types of information.</Paragraph> <Paragraph position="11"> The informative templates associated with the selected topics are used to present the required information to the reader using expansion operators such as the &quot;description&quot; operator whose effect is to present the description of the selected topic. For example, if the &quot;topic of the document&quot; template (Figure 4) is selected by the informative selection the following indicative text will be presented: Topic ol the document template Entity description template Main predicate: &quot;describes&quot;: DESCRIBE Where: nil Who: &quot;This paper&quot;: PAPER What: &quot;the Active Telepresence System with an integrated AR system to enhance the operator's sense of presence in hazardous environments&quot; &quot; Position: Number 1 from &quot;Conclusion&quot; Section Topic candidates: &quot;the Active Telepresence System&quot;, &quot;an integrated AR system&quot;, &quot;the operator's sense&quot;, &quot;presence&quot;, &quot;hazardous environments&quot; Weight :...</Paragraph> <Paragraph position="12"> Main predicate: &quot;consist of&quot; : CONSIST OF Topical entity: &quot;The Active Telepresence System&quot; null Related entities: &quot;three distinct elements&quot;, &quot;the stereo head&quot;, &quot;its controller&quot;, &quot;the display device&quot; Position: Number 4 from &quot;The Active Telepres- null and virtual worlds&quot; J. Pretlove, Industrial Robot, voi.25, issue 6, 1998. Describes the Active Telepresence System with an integrated AR system to enhance the operator's sense of presence in hazardous environments.</Paragraph> <Paragraph position="13"> Topics: Active Telepresence System (description); AR system (description); AR (definition) If the reader choses to expand the description of the topic &quot;Active Telepresence System&quot;, the following text will be presented: The Active Telepresence System consists of three distinct elements: the stereo head, its controller and the display device.</Paragraph> <Paragraph position="14"> The pre-processing and interpretation step axe currently implemented. We axe testing the processes of indicative and informative selection and we are developping the generation step.</Paragraph> </Section> class="xml-element"></Paper>