File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/j02-4002_metho.xml
Size: 39,011 bytes
Last Modified: 2025-10-06 14:07:55
<?xml version="1.0" standalone="yes"?> <Paper uid="J02-4002"> <Title>c(c) 2002 Association for Computational Linguistics Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status</Title> <Section position="5" start_page="425" end_page="431" type="metho"> <SectionTitle> 4. The System </SectionTitle> <Paragraph position="0"> We now describe an automatic system that can perform extraction and classification of rhetorical status on unseen text (cf. also a prior version of the system reported in Teufel and Moens [2000] and Teufel [1999]). We decided to use machine learning to perform this extraction and classification, based on a variety of sentential features similar to the ones reported in the sentence extraction literature. Human annotation is used as training material such that the associations between these sentential features and the target sentences can be learned. It is also used as gold standard for intrinsic system evaluation.</Paragraph> <Paragraph position="1"> A simpler machine learning approach using only word frequency information and no other features, as typically used in tasks like text classification, could have been employed (and indeed Nanba and Okumura [1999] do so for classifying citation contexts). To test if such a simple approach would be enough, we performed a text categorization experiment, using the Rainbow implementation of a na&quot;ive Bayes term frequency times inverse document frequency (TF*IDF) method (McCallum 1997) and considering each sentence as a &quot;document.&quot; The result was a classification performance of K = .30; the classifier nearly always chooses OWN and OTHER segments. The rare but important categories AIM,BACKGROUND,CONTRAST, and BASIS could be retrieved only with low precision and recall. Therefore, text classification methods do not provide a solution to our problem. This is not surprising, given that the definition of our task has little to do with the distribution of &quot;content-bearing&quot; words and phrases, much less so than the related task of topic segmentation (Morris and Hirst 1991; Hearst 1997; Choi 2000), or Saggion and Lapalme's (2000) approach to the summarization of scientific articles, which relies on scientific concepts and their relations. Instead, we predict that other indicators apart from the simple words contained in the sentence could provide strong evidence for the modeling of rhetorical status. Also, the relatively small amount of training material we have at our disposal requires a machine learning method that makes optimal use of as many different kinds of features as possible. We predicted that this would increase precision and recall on the categories in which we are interested.</Paragraph> <Paragraph position="2"> The text classification experiment is still useful as it provides a nontrivial baseline for comparison with our intrinsic system evaluation presented in section 5.</Paragraph> <Section position="1" start_page="425" end_page="425" type="sub_section"> <SectionTitle> 4.1 Classifiers </SectionTitle> <Paragraph position="0"> Weuseana&quot;ive Bayesian model as in Kupiec, Pedersen, and Chen's (1995) experiment (cf. Figure 9). Sentential features are collected for each sentence (Table 4 gives an overview of the features we used). Learning is supervised: In the training phase, associations between these features and human-provided target categories are learned.</Paragraph> <Paragraph position="1"> The target categories are the seven categories in the rhetorical annotation experiment and relevant/nonrelevant in the relevance selection experiment. In the testing phase, the trained model provides the probability of each target category for each sentence of unseen text, on the basis of the sentential features identified for the sentence.</Paragraph> </Section> <Section position="2" start_page="425" end_page="431" type="sub_section"> <SectionTitle> 4.2 Features </SectionTitle> <Paragraph position="0"> Some of the features in our feature pool are unique to our approach, for instance, the metadiscourse features. Others are borrowed from the text extraction literature (Paice 1990) or related tasks and adapted to the problem of determining rhetorical status.</Paragraph> <Paragraph position="1"> Absolute location of a sentence. In the news domain, sentence location is the single most important feature for sentence selection (Brandow, Mitze, and Rau 1995); in our domain, location information, although less dominant, can still give a useful indication.</Paragraph> <Paragraph position="2"> Rhetorical zones appear in typical positions in the article, as scientific argumentation 6. Title Does the sentence contain words also occurring in the title or headlines? YesorNo 7. TF*IDF Does the sentence contain &quot;significant terms&quot; as determined by the TF*IDF measure? 14. Agent Type of agent 9 Agent Types or None 15. SegAgent Type of agent 9 Agent Types or None 16. Action Type of action, with or without negation 27 Action Types or None Teufel and Moens Summarizing Scientific Articles follows certain patterns (Swales 1990). For example, limitations of the author's own method can be expected to be found toward the end of the article, whereas limitations of other researchers' work are often discussed in the introduction. We observed that the size of rhetorical zones depends on location, with smaller rhetorical zones occurring toward the beginning and the end of the article. We model this by assigning location values in the following fashion: The article is divided into 20 equal parts, counting sentences. Sentences occurring in parts 1, 2, 3, 4, 19, and 20 receive the values A, B, C, D, I, and J, respectively. Parts 5 and 6 are pooled, and sentences occurring in them are given the value E; the same procedure is applied to parts 15 and 16 (value G) and 17 and 18 (value H). The remaining sentences in the middle (parts 7-14) all receive the value F (cf. Figure 10).</Paragraph> <Paragraph position="3"> Section structure. Sections can have an internal structuring; for instance, sentences toward the beginning of a section often have a summarizing function. The section location feature divides each section into three parts and assigns seven values: first sentence, last sentence, second or third sentence, second-last or third-last sentence, or else either somewhere in the first, second, or last third of the section. Paragraph structure. In many genres, paragraphs also have internal structure (Wiebe 1994), with high-level or summarizing sentences occurring more often at the periphery of paragraphs. In this feature, sentences are distinguished into those leading or ending a paragraph and all others.</Paragraph> <Paragraph position="4"> Headlines. Prototypical headlines can be an important predictor of the rhetorical status of sentences occurring in the given section; however, not all texts in our collection use such headlines. Whenever a prototypical headline is recognized (using a set of regular expressions), it is classified into one of the following 15 classes: Introduction, Implementation, Example, Conclusion, Result, Evaluation, Solution, Experiment, Discussion, Method, Problems, Related Work, Data, Further Work, Problem Statement. If none of the patterns match, the value Non-Prototypical is assigned.</Paragraph> <Paragraph position="5"> Sentence length. Kupiec, Pedersen, and Chen (1995) report sentence length as a useful feature for text extraction. In our implementation, sentences are divided into long or short sentences, by comparison to a fixed threshold (12 words).</Paragraph> <Paragraph position="6"> Title word contents. Sentences containing many &quot;content-bearing&quot; words have been hypothesized to be good candidates for text extraction. Baxendale (1958) extracted all words except those on the stop list from the title and the headlines and determined for each sentence whether or not it contained these words. We received better results by excluding headline words and using only title words.</Paragraph> <Paragraph position="7"> TF*IDF word contents. How content-bearing a word is can also be measured with frequency counts (Salton and McGill 1983). The TF*IDF formula assigns high values to words that occur frequently in one document, but rarely in the overall collection of documents. We use the 18 highest-scoring TF*IDF words and classify sentences into those that contain one or more of these words and those that do not.</Paragraph> <Paragraph position="8"> Verb syntax. Linguistic features like tense and voice often correlate with rhetorical zones; Biber (1995) and Riley (1991) show correlation of tense and voice with prototypical section structure (&quot;method,&quot; &quot;introduction&quot;). In addition, the presence or absence Computational Linguistics Volume 28, Number 4 of a modal auxiliary might be relevant for detecting the phenomenon of &quot;hedging&quot; (i.e., statements in which an author distances herself from her claims or signals low certainty: these results might indicate that ...possibly ... [Hyland 1998]). For each sentence, we use part-of-speech-based heuristics to determine tense, voice, and presence of modal auxiliaries. This algorithm is shared with the metadiscourse features, and the details are described below.</Paragraph> <Paragraph position="9"> Citation. There are many connections between citation behavior and relevance or rhetorical status. First, if a sentence contains a formal citation or the name of another author mentioned in the bibliography, it is far more likely to talk about other work than about own work. Second, if it contains a self-citation, it is far more likely to contain a direct statement of continuation (25%) than a criticism (3%). Third, the importance of a citation has been related to the distinction between authorial and parenthetical citations. Citations are called authorial if they form a syntactically integral part of the sentence or parenthetical if they do not (Swales 1990). In most cases, authorial citations are used as the subject of a sentence, and parenthetical ones appear toward the middle or the end of the sentence.</Paragraph> <Paragraph position="10"> We built a recognizer for formal citations. It parses the reference list at the end of the article and determines whether a citation is a self-citation (i.e., if there is an overlap between the names of the cited researchers and the authors of the current paper), and it also finds occurrences of authors' names in running text, but outside of formal citation contexts (e.g., Chomsky also claims that ...). The citation feature reports whether a sentence contains an author name, a citation, or nothing. If it contains a citation, the value records whether it is a self-citation and also records the location of the citation in the sentence (in the beginning, the middle, or the end). This last distinction is a heuristic for the authorial/parenthetical distinction. We also experimented with including the number of different citations in a sentence, but this did not improve results.</Paragraph> <Paragraph position="11"> History. As there are typical patterns in the rhetorical zones (e.g., AIM sentences tend to follow CONTRAST sentences), we wanted to include the category assigned to the previous sentence as one of the features. In unseen text, however, the previous target is unknown at training time (it is determined during testing). It can, however, be calculated as a second pass process during training. In order to avoid a full Viterbi search of all possibilities, we perform a beam search with width of three among the candidates of the previous sentence, following Barzilay et al. (2000).</Paragraph> <Paragraph position="12"> Formulaic expressions. We now turn to the last three features in our feature pool, the metadiscourse features, which are more sophisticated than the other features. The first metadiscourse feature models formulaic expressions like the ones described by Swales, as they are semantic indicators that we expect to be helpful for rhetorical classification. We use a list of phrases described by regular expressions, similar to Paice's (1990) grammar. Our list is divided into 18 semantic classes (cf. Table 5), comprising a total of 644 patterns. The fact that phrases are clustered is a simple way of dealing with data sparseness. In fact, our experiments in section 5.1.2 will show the usefulness of our (manual) semantic clusters: The clustered list performs much better than the unclustered list (i.e., when the string itself is used as a value instead of its semantic class).</Paragraph> <Paragraph position="13"> Agent. Agents and actions are more challenging to recognize. We use a mechanism that, dependent on the voice of a sentence, recognizes agents (subjects or prepositional phrases headed by by) and their predicates (&quot;actions&quot;). Classification of agents and actions relies on a manually created lexicon of manual classes. As in the Formulaic feature, similar agents and actions are generalized and clustered together to avoid data sparseness.</Paragraph> <Paragraph position="14"> The lexicon for agent patterns (cf. Table 6) contains 13 types of agents and a total of 167 patterns. These 167 patterns expand to many more strings as we use a replace mechanism (e.g., the placeholder WORK NOUN in the sixth row of Table 6 can be replaced by a set of 37 nouns including theory, method, prototype, algorithm).</Paragraph> <Paragraph position="15"> The main three agent types we distinguish are US AGENT, THEM AGENT, and GENERAL AGENT, following the types of intellectual attribution discussed above. A fourth type is US PREVIOUS AGENT (the authors, but in a previous paper).</Paragraph> <Paragraph position="16"> Additional agent types include nonpersonal agents like aims, problems, solutions, absence of solution, or textual segments. There are four equivalence classes of Computational Linguistics Volume 28, Number 4 agents with ambiguous reference (&quot;this system&quot;): REF AGENT, REF US AGENT, THEM PRONOUN AGENT, and AIM REF AGENT.</Paragraph> <Paragraph position="17"> Agent classes were created based on intuition, but subsequently each class was tested with corpus statistics to determine whether it should be removed or not. We wanted to find and exclude classes that had a distribution very similar to the overall distribution of the target categories, as such features are not distinctive. We measured associations using the log-likelihood measure (Dunning 1993) for each combination of target category and semantic class by converting each cell of the contingency into a 2x2 contingency table. We kept only classes of verbs in which at least one category showed a high association (gscore > 5.0), as that means that in these cases the distribution was significantly different from the overall distribution. The last column in Table 6 shows that the classes THEM PRONOUN, GENERAL, SOLUTION, PROBLEM, and REF were removed; removal improved the performance of the Agent feature.</Paragraph> <Paragraph position="18"> SegAgent. SegAgent is a variant of the Agent feature that keeps track of previously recognized agents; unmarked sentences receive these previous agents as a value (in the Agent feature, they would have received the value None).</Paragraph> <Paragraph position="19"> Action. We use a manually created action lexicon containing 365 verbs (cf. Table 7). The verbs are clustered into 20 classes based on semantic concepts such as similarity, contrast, competition, presentation, argumentation, and textual structure. For example, PRESENTATION ACTIONs include communication verbs like present, report, and state (Myers 1992; Thompson and Yiyun 1991), RESEARCH ACTIONS include analyze, conduct, define and observe, and ARGUMENTATION ACTIONS include argue, disagree, and object to. Domain-specific actions are contained in the classes indicating a problem (fail, degrade, waste, overestimate) and solution-contributing actions (circumvent, solve, mitigate). The AFFECT we hope to improve our results 9X ARGUMENTATION we argue against a model of 19 X AWARENESS we are not aware of attempts 5+ BETTER SOLUTION our system outperforms ... 9 [?] CHANGE we extend CITE's algorithm 23 COMPARISON we tested our system against ... 4 CONTINUATION we follow CITE ... 13 CONTRAST our approach differs from ... 12 [?] FUTURE INTEREST we intend to improve ... 4X INTEREST we are concerned with ... 28 NEED this approach, however, lacks ... 8X PRESENTATION we present here a method for ... 19 [?] PROBLEM this approach fails ... 61 [?] RESEARCH we collected our data from ... 54 SIMILAR our approach resembles that of 13 SOLUTION we solve this problem by ... 64 TEXTSTRUCTURE the paper is organized ... 13 USE we employ CITE's method ... 5 COPULA our goal is to ... 1 POSSESSION we have three goals ... 1 Total of 20 classes 365 Teufel and Moens Summarizing Scientific Articles recognition of negation is essential; the semantics of not solving is closer to being problematic than it is to solving.</Paragraph> <Paragraph position="20"> The following classes were removed by the gscore test described above, because their distribution was too similar to the overall distribution: FUTURE INTEREST, NEED, ARGUMENTATION, AFFECT in both negative and positive contexts (X in last column of Table 7), and AWARENESS only in positive context (+ in last column). The following classes had too few occurrences in negative context (< 10 occurrences in the whole verb class) and thus the negative context of the class was also removed: BETTER SOLUTION, CONTRAST, PRESENTATION, PROBLEM ([?] in last column). Again, the removal improved the performance of the Action feature.</Paragraph> <Paragraph position="21"> The algorithm for determining agents and actions relies on finite-state patterns over part-of-speech (POS) tags. Starting from each finite verb, the algorithm collects chains of auxiliaries belonging to the associated finite clause and thus determines the clause's tense and voice. Other finite verbs and commas are assumed to be clause boundaries. Once the semantic verb is found, its stem is looked up in the action lexicon. Negation is determined if one of 32 fixed negation words is present in a six-word window to the right of the finite verb.</Paragraph> <Paragraph position="22"> As our classifier requires one unique value for each classified item for each feature, we had to choose one value for sentences containing more than one finite clause. We return the following values for the action and agents feature: the first agent/action pair, if both are nonzero, otherwise the first agent without an action, otherwise the first action without an agent, if available.</Paragraph> <Paragraph position="23"> In order to determine the level of correctness of agent and action recognition, we had first to evaluate manually the error level of the POS tagging of finite verbs, as our algorithm crucially relies on finite verbs. In a random sample of 100 sentences from our development corpus that contain any finite verbs at all (they happened to contain a total of 184 finite verbs), the tagger (which is part of the TTT software) showed a recall of 95% and a precision of 93%.</Paragraph> <Paragraph position="24"> We found that for the 174 correctly determined finite verbs, the heuristics for negation and presence of modal auxiliaries worked without any errors (100% accuracy, eight negated sentences). The correct semantic verb was determined with 96% accuracy; most errors were due to misrecognition of clause boundaries. Action Type lookup was fully correct (100% accuracy), even in the case of phrasal verbs and longer idiomatic expressions (have to is a NEED ACTION; be inspired by is a CONTINUE ACTION). There were seven voice errors, two of which were due to POS-tagging errors (past participle misrecognized). The remaining five voice errors correspond to 98% accuracy. Correctness of Agent Type determination was tested on a random sample of 100 sentences containing at least one agent, resulting in 111 agents. No agent pattern that should have been identified was missed (100% recall). Of the 111 agents, 105 cases were correct (precision of 95%). Therefore, we consider the two features to be adequately robust to serve as sentential features in our system.</Paragraph> <Paragraph position="25"> Having detailed the features and classifiers of the machine learning system we use, we will now turn to an intrinsic evaluation of its performance.</Paragraph> </Section> </Section> <Section position="6" start_page="431" end_page="438" type="metho"> <SectionTitle> 5. Intrinsic System Evaluation </SectionTitle> <Paragraph position="0"> Our task is to perform content selection from scientific articles, which we do by classifying sentences into seven rhetorical categories. The summaries based on this classification use some of these sentences directly, namely, sentences that express the contribution of a particular article (AIM), sentences expressing contrasts with other work (CONTRAST), and sentences stating imported solutions from other work (BASIS). Other, Computational Linguistics Volume 28, Number 4 more frequent rhetorical categories, namely OTHER,OWN, and BACKGROUND, might also be extracted into the summary.</Paragraph> <Paragraph position="1"> Because the task is a mixture of extraction and classification, we report system performance as follows: * We first report precision and recall values for all categories, in comparison to human performance and the text categorization baseline, as we are primarily interested in good performance on the categories AIM,CONTRAST,BASIS, and BACKGROUND.</Paragraph> <Paragraph position="2"> * We are also interested in good overall classification performance, which we report using kappa and macro-F as our metric. We also discuss how well each single features does in the classification.</Paragraph> <Paragraph position="3"> * We then compare the extracted sentences to our human gold standard for relevance and report the agreement in precision and agreement per category.</Paragraph> <Section position="1" start_page="432" end_page="437" type="sub_section"> <SectionTitle> 5.1 Determination of Rhetorical Status </SectionTitle> <Paragraph position="0"> The results of stochastic classification were compiled with a 10-fold cross-validation on our 80-paper corpus. As we do not have much annotated material, cross-validation is a practical way to test as it can make use of the full development corpus for training, without ever using the same data for training and testing.</Paragraph> <Paragraph position="1"> 5.1.1 Overall Results. Table 8 and Figure 11 show that the stochastic model obtains substantial improvement over the baseline in terms of precision and recall of the important categories AIM,BACKGROUND,CONTRAST, and BASIS. We use the F-measure, defined by van Rijsbergen (1979) as</Paragraph> <Paragraph position="3"> , as a convenient way of reporting precision (P) and recall (R) in one value. F-measures for our categories range from .61 (TEXTUAL) and .52 (AIM) to .45 (BACKGROUND), .38 (BASIS), and .26 (CONTRAST). The recall for some categories is relatively low. As our gold standard is designed to contain a lot of redundant information for the same category, this is not too worrying. Low precision in some categories (e.g., 34% for CONTRAST, in contrast to human precision of 50%), however, could potentially present a problem for later steps in the document summarization process.</Paragraph> <Paragraph position="4"> Overall, we find these results encouraging, particularly in view of the subjective nature of the task and the high compression achieved (2% for AIM,BASIS, and TEXTUAL sentences, 5% for CONTRAST sentences, and 6% for BACKGROUND sentences). No direct comparison with Kupiec, Pedersen, and Chen's results is possible as different data sets are used and as Kupiec et al.'s relevant sentences do not directly map into one of our categories. Assuming, however, that their relevant sentences are probably most comparable to our AIM sentences, our precision and recall of 44% and 65% compare favorably to theirs (42% and 42%).</Paragraph> <Paragraph position="5"> Table 9 shows a confusion matrix between one annotator and the system. The system is likely to confuse AIM and OWN sentences (e.g., 100 out of 172 sentences incorrectly classified as AIM by the system turned out to be OWN sentences). It also shows a tendency to confuse OTHER and OWN sentences. The system also fails to distinguish categories involving other people's work (e.g. OTHER,BASIS, and CONTRAST). Overall, these tendencies mirror human errors, as can be seen from a comparison with accuracy, and macro-F (following Lewis [1991]). Macro-F is the mean of the F-measures of all seven categories. One reason for using macro-F and kappa is that we want to measure success particularly on the rare categories that are needed for our final task (i.e., AIM,BASIS, and CONTRAST). Microaveraging techniques like traditional accuracy tend to overestimate the contribution of frequent categories in skewed distributions like ours; this is undesirable, as OWN is the least interesting category for our purposes. This situation has parallels in information retrieval, where precision and recall are used because accuracy overestimates the performance on irrelevant items.</Paragraph> <Paragraph position="6"> In the case of macro-F, each category is treated as one unit, independent of the number of items contained in it. Therefore, the classification success of the individual items in rare categories is given more importance than the classification success of frequent-category items. When looking at the numerical values, however, one should keep in mind that macroaveraging results are in general numerically lower (Yang and Liu 1999). This is because there are fewer training cases for the rare categories, which therefore perform worse with most classifiers.</Paragraph> <Paragraph position="7"> In the case of kappa, classifications that incorrectly favor frequent categories are punished because of a high random agreement. This effect can be shown most easily when the baselines are considered. The most ambitious baseline we use is the output of a text categorization system, as described in section 4. Other possible baselines, which are all easier to beat, include classification by the most frequent category. This baseline turns out to be trivial, as it does not extract any of the rare rhetorical categories in which we are particularly interested, and therefore receives a low kappa value at K = [?].12. Possible chance baselines include random annotation with uniform distribution (K = [?].10; accuracy of 14%) and random annotation with observed distribution. The latter baseline is built into the definition of kappa (K = 0; accuracy of 48%).</Paragraph> <Paragraph position="8"> Although our system outperforms an ambitious baseline (macro-F shows that our system performs roughly 20% better than text classification) and also performs much above chance, there is still a big gap in performance between humans and machine.</Paragraph> <Paragraph position="9"> Macro-F shows a 20% difference between our system and human performance. If the system is put into a pool of annotators for the 25 articles for which three-way human judgment exists, agreement drops from K = .71 to K = .59. This is a clear indication that the system's annotation is still distinguishably different from human annotation.</Paragraph> <Paragraph position="10"> the optimal feature combination (as determined by an exhaustive search in the space of feature combinations). The most distinctive single feature is Location (achieving an agreement of K = .22 against one annotator, if this feature is used as the sole feature), followed by SegAgent (K = .19), Citations (K = .18), Headlines (K = .17), Agent (K = .08), and Formulaic (K = .07). In each case, the unclustered versions of Agent, SegAgent, and Formulaic performed much worse than the clustered versions; they did not improve final results when added into the feature pool.</Paragraph> <Paragraph position="11"> Action performs slightly better at K = [?].11 than the baseline by most frequent category, but far worse than random by observed distribution. The following features on their own classify each sentence as OWN (and therefore achieve K = [?].12): Relative Location,Paragraphs,TF*IDF,Title,Sentence Length,Modality,Tense , and Voice. History performs very badly on its own at K = [?].51; it classifies almost all sentences as BACKGROUND. It does this because the probability of the first sentence's Teufel and Moens Summarizing Scientific Articles being a BACKGROUND sentence is almost one, and, if no other information is available, it is very likely that another BACKGROUND sentence will follow after a BACKGROUND sentence.</Paragraph> <Paragraph position="12"> Each of these features, however, still contributes to the final result: If any of them is taken out of the feature pool, classification performance decreases. How can this be, given that the individual features perform worse than chance? As the classifier derives the posterior probability by multiplying evidence from each feature, even slight evidence coming from one feature can direct the decision in the right direction. A feature that contributes little evidence on its own (too little to break the prior probability, which is strongly biased toward OWN) can thus, in combination with others, still help in disambiguating. For the na&quot;ive Bayesian classification method, indeed, it is most important that the features be as independent of each other as possible. This property cannot be assessed by looking at the feature's isolated performance, but only in combination with others.</Paragraph> <Paragraph position="13"> It is also interesting to see that certain categories are disambiguated particularly well by certain features (cf. Table 11). The Formulaic feature, which is by no means the strongest feature, is nevertheless the most diverse, as it contributes to the disambiguation of six categories directly. This is because many different rhetorical categories have typical cue phrases associated with them (whereas not all categories have a preferred location in the document). Not surprisingly, Location and History are the features particularly useful for detecting BACKGROUND sentences, and SegAgent additionally contributes toward the determination of BACKGROUND zones (along with the Formulaic and the Absolute Location features). The Agent and Action features also prove their worth as they manage to disambiguate categories that many of the other features alone cannot disambiguate (e.g., CONTRAST).</Paragraph> <Paragraph position="14"> of how the figures reported in the previous section translate into real output, we present in figure 12 the output of the system when run on the example paper (all AIM,CONTRAST, and BASIS sentences). The second column shows whether the human judge agrees with the system's decision (a tick for correct decisions, and the human's preferred category for incorrect decisions). Ten out of the 15 extracted sentences have been classified correctly.</Paragraph> <Paragraph position="15"> The example also shows that the determination of rhetorical status is not always straightforward. For example, whereas the first AIM sentence that the system proposes (sentence 8) is clearly wrong, all other &quot;incorrect&quot; AIM sentences carry important in- null evidence that they tend to participate in the same events.</Paragraph> <Paragraph position="16"> [?] *10 Our research addresses some of the same questions and uses similar raw data, but we investigate how to factor word association tendencies into associations of words to certain hidden senses classes and associations between the classes themselves.</Paragraph> <Paragraph position="17"> [?] 11 While it may be worthwhile to base such a model on preexisting sense classes (Resnik, 1992), in the work described here we look at how to derive the classes directly from distributional data.</Paragraph> <Paragraph position="18"> (OWN) 12 More specifically, we model senses as probabilistic concepts or clusters c with corresponding cluster membership probabilities EQN for each word w.</Paragraph> <Paragraph position="19"> [?] *22 We will consider here only the problem of classifying nouns according to their distribution as direct objects of verbs; the converse problem is formally similar.</Paragraph> <Paragraph position="20"> (CTR) 41 However, this is not very satisfactory because one of the goals of our work is precisely to avoid the problems of data sparseness by grouping words into classes.</Paragraph> <Paragraph position="21"> (OWN) 150 We also evaluated asymmetric cluster models on a verb decision task closer to possible applications to disambiguation in language analysis.</Paragraph> <Paragraph position="22"> [?] * 162 We have demonstrated that a general divisive clustering procedure for probability distributions can be used to group words according to their participation in particular grammatical relations with other words.</Paragraph> <Paragraph position="23"> statistical part-of-speech tagger (Church, 1988) and of tools for regular expression pattern matching on tagged corpora (Yarowsky, 1992).</Paragraph> <Paragraph position="24"> [?] *113 The analogy with statistical mechanics suggests a deterministic annealing procedure for clustering (Rose et al., 1990), in which the number of clusters is determined through a sequence of phase transitions by continuously increasing the parameter EQN following an annealing schedule.</Paragraph> <Paragraph position="25"> CTR [?] *9 His notion of similarity seems to agree with our intuitions in many cases, but it is not clear how it can be used directly to construct word classes and corresponding models of association.</Paragraph> <Paragraph position="26"> [?] *14 Class construction is then combinatorially very demanding and depends on frequency counts for joint events involving particular words, a potentially unreliable source of information as we noted above.</Paragraph> <Paragraph position="27"> (OWN) 21 We have not yet compared the accuracy and coverage of the two methods, or what systematic biases they might introduce, although we took care to filter out certain systematic errors, for instance the misparsing of the subject of a complement clause as the direct object of a main verb for report verbs like &quot;say&quot;.</Paragraph> <Paragraph position="28"> [?] 43 This is a useful advantage of our method compared with agglomerative clustering techniques that need to compare individual objects being considered for grouping.</Paragraph> <Paragraph position="29"> Figure 12 System output for example paper.</Paragraph> <Paragraph position="30"> formation about research goals of the paper: Sentence 41 states the goal in explicit terms, but it also contains a contrastive statement, which the annotator decided to rate higher than the goal statement. Both sentences 12 and 150 give high-level descriptions of the work that might pass as a goal statement. Similarly, in sentence 21 the agent and action features detected that the first part of the sentence has something to do with comparing methods, and the system then (plausibly but incorrectly) decided Teufel and Moens Summarizing Scientific Articles to classify the sentence as CONTRAST. All in all, we feel that the extracted material conveys the rhetorical status adequately. An extrinsic evaluation additionally showed that the end result provides considerable added value when compared to sentence extracts (Teufel 2001).</Paragraph> </Section> <Section position="2" start_page="437" end_page="437" type="sub_section"> <SectionTitle> 5.2 Relevance Determination </SectionTitle> <Paragraph position="0"> The classifier for rhetorical status that we evaluated in the previous section is an important first step in our implementation; the next step is the determination of relevant sentences in the text. One simple solution for relevance decision would be to use all AIM,BASIS, and CONTRAST sentences, as these categories are rare overall. The classifier we use has the nice property of roughly keeping the distribution of target categories, so that we end up with a sensible number of these sentences.</Paragraph> <Paragraph position="1"> The strategy of using all AIM,CONTRAST, and BASIS sentences can be evaluated in a similar vein to the previous experiment. In terms of relevance, the asterisk in figure 12 marks sentences that the human judge found particularly relevant in the overall context (cf. the full set in figure 5). Six out of all 15 sentences, and 6 out of the 10 sentences that received the correct rhetorical status, were judged relevant in the example.</Paragraph> <Paragraph position="2"> Table 12 reports the figure for the entire corpus by comparing the system's output of correctly classified rhetorical categories to human judgment. In all cases, the results are far above the nontrivial baseline. On AIM,CONTRAST, and BASIS sentences, our system achieves very high precision values of 96%, 70%, and 71%. Recall is lower at 70%, 24%, and 39%, but low recall is less of a problem in our final task. Therefore, the main bottleneck is correct rhetorical classification. Once that is accomplished, the selected categories show high agreement with human judgment and should therefore represent good material for further processing steps.</Paragraph> <Paragraph position="3"> If, however, one is also interested in selecting BACKGROUND sentences, as we are, simply choosing all BACKGROUND sentences would result in low precision of 16% (albeit with a high recall of 83%), which does not seem to be the optimal solution.</Paragraph> <Paragraph position="4"> We therefore use a second classifier for finding the most relevant sentences independently that was trained on the relevance gold standard. Our best classifier operates at a precision of 46.5% and recall of 45.2% (using the features Location,Section</Paragraph> </Section> <Section position="3" start_page="437" end_page="438" type="sub_section"> <SectionTitle> Struct,Paragraph Struct,Title,TF*IDF,Formulaic , and Citation for classifi- </SectionTitle> <Paragraph position="0"> cation). The second classifier (cf. rightmost columns in figure 12) raises the precision for BACKGROUND sentences from 16% to 38%, while keeping recall high at 88%.</Paragraph> <Paragraph position="1"> This example shows that the right procedure for relevance determination changes from category to category and also depends on the final task one is trying to accomplish.</Paragraph> <Paragraph position="2"> Computational Linguistics Volume 28, Number 4</Paragraph> </Section> </Section> class="xml-element"></Paper>