File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/02/j02-4002_concl.xml
Size: 10,392 bytes
Last Modified: 2025-10-06 13:53:17
<?xml version="1.0" standalone="yes"?> <Paper uid="J02-4002"> <Title>c(c) 2002 Association for Computational Linguistics Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status</Title> <Section position="7" start_page="438" end_page="440" type="concl"> <SectionTitle> 6. Discussion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="438" end_page="438" type="sub_section"> <SectionTitle> 6.1 Contribution </SectionTitle> <Paragraph position="0"> We have presented a new method for content selection from scientific articles. The analysis is genre-specific; it is based on rhetorical phenomena specific to academic writing, such as problem-solution structure, explicit intellectual attribution, and statements of relatedness to other work. The goal of the analysis is to identify the contribution of an article in relation to background material and to other specific current work.</Paragraph> <Paragraph position="1"> Our methodology is situated between text extraction methods and fact extraction (template-filling) methods: Although our analysis has the advantage of being more context-sensitive than text extraction methods, it retains the robustness of this approach toward different subdomains, presentational traditions, and writing styles.</Paragraph> <Paragraph position="2"> Like fact extraction methods (e.g., Radev and McKeown 1998), our method also uses a &quot;template&quot; whose slots are being filled during analysis. The slots of our template are defined as rhetorical categories (like &quot;Contrast&quot;) rather than by domain-specific categories (like &quot;Perpetrator&quot;). This makes it possible for our approach to deal with texts of different domains and unexpected topics.</Paragraph> <Paragraph position="3"> Sparck Jones (1999) argues that it is crucial for a summarization strategy to relate the large-scale document structure of texts to readers' tasks in the real world (i.e., to the proposed use of the summaries). We feel that incorporating a robust analysis of discourse structure into a document summarizer is one step along this way.</Paragraph> <Paragraph position="4"> Our practical contributions are twofold. First, we present a scheme for the annotation of sentences with rhetorical status, and we have shown that the annotation is stable (K = .82, .81, .76) and reproducible (K = .71). Since these results indicate that the annotation is reliable, we use it as our gold standard for evaluation and training.</Paragraph> <Paragraph position="5"> Second, we present a machine learning system for the classification of sentences by relevance and by rhetorical status. The contribution here is not the statistical classifier, which is well-known and has been used in a similar task by Kupiec, Pedersen, and Oren (1995), but instead the features we use. We have adapted 13 sentential features in such a way that they work robustly for our task (i.e., for unrestricted, real-world text). We also present three new features that detect scientific metadiscourse in a novel way. The results of an intrinsic system evaluation show that the system can identify sentences expressing the specific goal of a paper with 57% precision and 79% recall, sentences expressing criticism or contrast with 57% precision and 42% recall, and sentences expressing a continuation relationship to other work with 62% precision and 43% recall. This substantially improves a baseline of text classification which uses only a TF*IDF model over words. The agreement of correctly identified rhetorical roles with human relevance judgments is even higher (96% precision and 70% recall for goal statements, 70% precision and 24% recall for contrast, 71% precision and 39% recall for continuation). We see these results as an indication that shallow discourse processing with a well-designed set of surface-based indicators is possible.</Paragraph> </Section> <Section position="2" start_page="438" end_page="440" type="sub_section"> <SectionTitle> 6.2 Limitations and Future Work </SectionTitle> <Paragraph position="0"> The metadiscourse features, one focus of our work, currently depend on manual resources. The experiments reported here explore whether metadiscourse information is useful for the automatic determination of rhetorical status (as opposed to more shallow features), and this is clearly the case. The next step, however, should be the automatic creation of such resources. For the task of dialogue act disambiguation, Samuel, Carberry, and Vijay-Shanker (1999) suggest a method of automatically finding cue phrases for disambiguation. It may be possible to apply this or a similar method to our data and to compare the performance of automatically gained resources with manual ones.</Paragraph> <Paragraph position="1"> Teufel and Moens Summarizing Scientific Articles Further work can be done on the semantic verb clusters described in section 4.2.</Paragraph> <Paragraph position="2"> Klavans and Kan (1998), who use verb clusters for document classification according to genre, observe that verb information is rarely used in current practical natural language applications. Most tasks such as information extraction and document classification identify and use nominal constructs instead (e.g., noun phrases, TF*IDF words and phrases).</Paragraph> <Paragraph position="3"> The verb clusters we employ were created using our intuition of which type of verb similarity would be useful in the genre and for the task. There are good reasons for using such a hand-crafted, genre-specific verb lexicon instead of a general resource such as WordNet or Levin's (1993) classes: Many verbs used in the domain of scientific argumentation have assumed a specialized meaning, which our lexicon readily encodes. Klavans and Kan's classes, which are based on Levin's classes, are also manually created. Resnik and Diab (2000) present yet other measures of verb similarity, which could be used to arrive at a more data-driven definition of verb classes. We are currently comparing our verb clusterings to Klavans and Kan's, and to bottom-up clusters of verb similarities generated from our annotated data.</Paragraph> <Paragraph position="4"> The recognition of agents, which is already the second-best feature in the pool, could be further improved by including named entity recognition and anaphora resolution. Named entity recognition would help in cases like the following, LHIP provides a processing method which allows selected portions of the input to be ignored or handled differently. (S-5, 9408006) where LHIP is the name of the authors' approach and should thus be tagged as US AGENT; to do so, however, one would need to recognize it as a named approach, which is associated with the authors. It is very likely that such a treatment, which would have to include information from elsewhere in the text, would improve results, particularly as named approaches are frequent in the computational linguistics domain. Information about named approaches in themselves would also be an important aspect to include in summaries or citation indexes.</Paragraph> <Paragraph position="5"> Anaphora resolution helps in cases in which the agent is syntactically ambiguous between own and other approaches (e.g., this system). To test whether and how much performance would improve, we manually simulated anaphora resolution on the 632 occurrences of REF AGENT in the development corpus. (In the experiments in section 5 these occurrences had been excluded from the Agent feature by giving them the value None; we include them now in their disambiguated state). Of the 632 REF AGENTs, 436 (69%) were classified as US AGENT, 175 (28%) as THEM AGENT, and 20 (3%) as GENERAL AGENT. As a result of this manual disambiguation, the performance of the Agent feature increased dramatically from K = .08 to K = .14 and that of SegAgent from K = .19 to K = .22. This is a clear indication of the potential added value of anaphora resolution for our task.</Paragraph> <Paragraph position="6"> As far as the statistical classification is concerned, our results are still far from perfect. Obvious ways of improving performance are the use of a more sophisticated statistical classifier and more training material. We have experimented with a maximum entropy model, Repeated Incremental Pruning to Produce Error Reduction (RIP-PER), and decision trees; preliminary results do not show significant improvement over the na&quot;ive Bayesian model. One problem is that 4% of the sentences in our current annotated material are ambiguous: They receive the same feature representation but are classified differently by the annotators. A possible solution is to find better and more distinctive features; we believe that robust, higher-level features like actions and agents are a step in the right direction. We also suspect that a big improvement Computational Linguistics Volume 28, Number 4 could be achieved with smaller annotation units. Many errors come from instances in which one half of a sentence serves one rhetorical purpose, the other another, as in the following example: The current paper shows how to implement this general notion, without following Krifka's analysis in detail. (S-10, 9411019) Here, the first part describes the paper's research goal, whereas the second expresses a contrast. Currently, one target category needs to be associated with the whole sentence (according to a rule in the guidelines, AIM is given preference over CONTRAST). As an undesired side effect, the CONTRAST-like textual parts (and the features associated with this text piece, e.g., the presence of an author's name) are wrongly associated with the AIM target category. If we allowed for a smaller annotation unit (e.g., at the clause level), this systematic noise in the training data could be removed.</Paragraph> <Paragraph position="7"> Another improvement in classification accuracy might be achieved by performing the classification in a cascading way. The system could first perform a classification into OWN-like classes (OWN,AIM, and TEXTUAL pooled), OTHER-like categories (OTHER, CONTRAST, and BASIS pooled), and BACKGROUND, similar to the way human annotation proceeds. Subclassification among these classes would then lead to the final seven-way</Paragraph> </Section> </Section> class="xml-element"></Paper>