File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1014_metho.xml
Size: 17,481 bytes
Last Modified: 2025-10-06 14:07:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1014"> <Title>Towards Automatic Classification of Discourse Elements in Essays</Title> <Section position="3" start_page="1" end_page="1" type="metho"> <SectionTitle> 2 What Are Thesis Statements? </SectionTitle> <Paragraph position="0"> A thesis statement is defined as the sentence that explicitly identifies the purpose of the paper or previews its main ideas (see footnote 1). This definition seems straightforward enough, and would lead one to believe that even for people to identify the thesis statement in an essay would be clear-cut. However, the essay in Figure 1 is a common example of the kind of first-draft writing that our system has to handle. Figure 1 shows a student response to the essay question: Often in life we experience a conflict in choosing between something we &quot;want&quot; to do and something we feel we &quot;should&quot; do. In your opinion, are there any circumstances in which it is better for people to do what they &quot;want&quot; to do rather than what they feel they &quot;should&quot; do? Support your position with evidence from your own experience or your observations of other people.</Paragraph> <Paragraph position="1"> The writing in Figure 1 illustrates one kind of challenge in automatic identification of discourse elements, such as thesis statements. In this case, the two human annotators independently chose different text as the thesis statement (the two texts highlighted in bold and italics in Figure 1). In this kind of first-draft writing, it is not uncommon for writers to repeat ideas, or express more than one general opinion about the topic, resulting in text that seems to contain multiple thesis statements.</Paragraph> <Paragraph position="2"> Before building a system that automatically identifies thesis statements in essays, we wanted to determine whether the task was well-defined. In collaboration with two writing experts, a simple discourse-based annotation protocol was developed to manually annotate discourse elements in essays for a single essay topic.</Paragraph> <Paragraph position="3"> This was the initial attempt to annotate essay data using discourse elements generally associated with essay structure, such as thesis statement, concluding statement, and topic sentences of the essay's main ideas. The writing experts defined the characteristics of the discourse labels. These experts then annotated 100 essay responses to one English Proficiency Test (EPT) question, called Topic B, using a PC-based interface implemented in Java.</Paragraph> <Paragraph position="4"> We computed the agreement between the two human annotators using the kappa coefficient (Siegel and Castellan, 1988), a statistic used extensively in previous empirical studies of discourse. The kappa statistic measures pairwise agreement among a set of coders who make categorial judgments, correcting for chance expected agreement.</Paragraph> <Paragraph position="5"> The kappa agreement between the two annotators with respect to the thesis statement labels was 0.733 (N=2391, where 2391 represents the total number of sentences across all annotated essay responses). This shows high agreement based on research in content analysis (Krippendorff, 1980) that suggests that values of kappa higher than 0.8 reflect very high agreement and values higher than 0.6 reflect good agreement. The corresponding z statistic was 27.1, which reflects a confidence level that is much higher than 0.01, for which the corresponding z value is 2.32 (Siegel and Castellan, 1988).</Paragraph> <Paragraph position="6"> In the early stages of our project, it was suggested to us that thesis statements reflect the most important sentences in essays. In terms of summarization, these sentences would represent indicative, generic summaries (Mani and Maybury, 1999; Marcu, 2000). To test this hypothesis (and estimate the adequacy of using summarization technology for identifying thesis statements), we carried out an additional experiment. The same annotation tool was used with two different human judges, who were asked this time to identify the most important sentence of each essay. The agreement between human judges on the task of identifying summary sentences was significantly lower: the kappa was 0.603 (N=2391). Tables 1a and 1b summarize the results of the annotation experiments.</Paragraph> <Paragraph position="7"> Table 1a shows the degree of agreement between human judges on the task of identifying thesis statements and generic summary sentences. The agreement figures are given using the kappa statistic and the relative precision (P), recall (R), and F-values (F), which reflect the ability of one judge to identify the sentences labeled as thesis statements or summary sentences by the other judge. The results in Table 1a show that the task of thesis statement identification is much better defined than the task of identifying important summary sentences. In addition, Table 1b indicates that there is very little overlap between thesis and generic summary sentences: just 6% of the summary sentences were labeled by human judges as thesis statement sentences. This strongly suggests that there are critical differences between thesis statements and summary sentences, at least in first-draft essay writing. It is possible that thesis statements reflect an intentional facet (Grosz and Sidner, 1986) of language, while summary sentences reflect a semantic one (Martin, 1992).</Paragraph> <Paragraph position="8"> More detailed experiments need to be carried out though before proper conclusions can be derived.</Paragraph> <Paragraph position="9"> The results in Table 1a provide an estimate for an upper bound of a thesis statement identification algorithm. If one can build an automatic classifier that identifies thesis statements at recall and precision levels as high as 70%, the performance of such a classifier will be indistinguishable from the performance of humans.</Paragraph> </Section> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 A Bayesian Classifier for </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> Identifying Thesis Statements 3.1 Description of the Approach </SectionTitle> <Paragraph position="0"> We initially built a Bayesian classifier for thesis statements using essay responses to one probabilistic models for text classification that can be used to train Bayesian independence classifiers. They describe the multinominal model as being the more traditional approach for statistical language modeling (especially in speech recognition applications), where a document is represented by a set of word occurrences, and where probability estimates reflect the number of word occurrences in a document. In using the alternative, multivariate Bernoulli model, a document is represented by both the absence and presence of features. On a text classification task, McCallum and Nigam (1998) show that the multivariate Bernoulli model performs well with small vocabularies, as opposed to the multinominal model which performs better when larger vocabularies are involved.</Paragraph> <Paragraph position="1"> Larkey (1998) uses the multivariate Bernoulli approach for an essay scoring task, and her results are consistent with the results of McCallum and Nigam (1998) (see also Larkey and Croft (1996) for descriptions of additional applications). In Larkey (1998), sets of essays used for training scoring models typically contain fewer than 300 documents.</Paragraph> <Paragraph position="2"> Furthermore, the vocabulary used across these documents tends to be restricted.</Paragraph> <Paragraph position="3"> Based on the success of Larkey's experiments, and McCallum and Nigam's findings that the multivariate Bernoulli model performs better on texts with small vocabularies, this approach would seem to be the likely choice when dealing with data sets of essay responses. Therefore, we have adopted this approach in order to build a thesis statement classifier that can select from an essay the sentence that is the most likely candidate to be labeled as thesis statement.</Paragraph> <Paragraph position="4"> In our research, we trained classifiers using a classical Bayes approach too, where two classifiers were built: a thesis classifier and a non-thesis In our experiments, we used three general feature types to build the classifier: sentence position; words commonly occurring in thesis statements; and RST labels from outputs generated by an existing rhetorical structure parser (Marcu, 2000).</Paragraph> <Paragraph position="5"> We trained the classifier to predict thesis statements in an essay. Using the multivariate Bernoulli formula, below, this gives us the log probability that a sentence (S) in an essay belongs to the class (T) of sentences that are thesis statements. We found that it helped performance to use a Laplace estimator to deal with cases where the probability estimates were equal to zero.</Paragraph> <Paragraph position="7"> if S does not contain A [?] In this formula, P(T) is the prior probability that a sentence is in class T, P(A</Paragraph> <Paragraph position="9"> |T) is the conditional probability of a sentence having feature A</Paragraph> <Paragraph position="11"> that the sentence is in T, and P(A</Paragraph> <Paragraph position="13"> ) is the prior probability that a sentence contains feature A</Paragraph> <Paragraph position="15"> , given that it is in T, and P( iA ) is the prior probability that a sentence does not contain feature A i.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.2 Features Used to Classify Thesis Statements 3.2.1 Positional Feature </SectionTitle> <Paragraph position="0"> We found that the likelihood of a thesis statement occurring at the beginning of essays was quite high in the human annotated data. To account for this, we used one feature that reflected the position of each sentence in an essay.</Paragraph> <Paragraph position="1"> classifier. In the classical Bayes implementation, each classifier was trained only on positive feature evidence, in contrast to the multivariate Bernoulli approach that trains classifiers both on the absence and presence of features. Since the performance of the classical Bayes classifiers was lower than the performance of the Bernoulli classifier, we report here only the performance of the latter.</Paragraph> <Paragraph position="2"> All words from human annotated thesis statements were used to build the Bayesian classifier. We will refer to these words as the thesis word list. From the training data, a vocabulary list was created that included one occurrence of each word used in all resolved human annotations of thesis statements. All words in this list were used as independent lexical features. We found that the use of various lists of stop words decreased the performance of our classifier, so we did not use them.</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> Features </SectionTitle> <Paragraph position="0"> According to RST (Mann and Thompson, 1988), one can associate a rhetorical structure tree to any text. The leaves of the tree correspond to elementary discourse units and the internal nodes correspond to contiguous text spans. Each node in a tree is characterized by a status (nucleus or satellite) and a rhetorical relation, which is a relation that holds between two non-overlapping text spans. The distinction between nuclei and satellites comes from the empirical observation that the nucleus expresses what is more essential to the writer's intention than the satellite; and that the nucleus of a rhetorical relation is comprehensible independent of the satellite, but not vice versa. When spans are equally important, the relation is multinuclear. Rhetorical relations reflect semantic, intentional, and textual relations that hold between text spans as is illustrated in Figure 2. For example, one text span may elaborate on another text span; the information in two text spans may be in contrast; and the information in one text span may provide background for the information presented in another text span.</Paragraph> <Paragraph position="1"> Figure 2 displays in the style of Mann and Thompson (1988) the rhetorical structure tree of a text fragment. In Figure 2, nuclei are represented using straight lines; satellites using arcs. Internal nodes are labeled with rhetorical relation names.</Paragraph> <Paragraph position="2"> We built RST trees automatically for each essay using the cue-phrase-based discourse parser of Marcu (2000). We then associated with each sentence in an essay a feature that reflected the status of its parent node (nucleus or satellite), and another feature that reflected its rhetorical relation. For example, for the last sentence in Figure 2 we associated the status satellite and the relation elaboration because that sentence is the satellite of an elaboration relation. For sentence 2, we associated the status nucleus and the relation elaboration because that sentence is the nucleus of an elaboration relation.</Paragraph> <Paragraph position="3"> We found that some rhetorical relations occurred more frequently in sentences annotated as thesis statements. Therefore, the conditional probabilities for such relations were higher and provided evidence that certain sentences were thesis statements. The Contrast relation shown in Figure 2, for example, was a rhetorical relation that occurred more often in thesis statements.</Paragraph> <Paragraph position="4"> Arguably, there may be some overlap between words in thesis statements, and rhetorical relations used to build the classifier. The RST relations, however, capture long distance relations between text spans, which are not accounted by the words in our thesis word list.</Paragraph> </Section> <Section position="4" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.3 Evaluation of the Bayesian classifier </SectionTitle> <Paragraph position="0"> We estimated the performance of our system using a six-fold cross validation procedure. We partitioned the 93 essays that were labeled by both human annotators with a thesis statement into six groups. (The judges agreed that 7 of the 100 essays they annotated had no thesis statement.) We trained six times on 5/6 of the labeled data and evaluated the performance on the other 1/6 of the data.</Paragraph> <Paragraph position="1"> The evaluation results in Table 2 show the average performance of our classifier with respect to the resolved annotation (Alg. wrt. Resolved), using traditional recall (R), precision (P), and F-value (F) metrics. For purposes of comparison, Table 2 also shows the performance of two baselines: the random baseline classifies the thesis randomly; while the position baseline assumes that the thesis statement is given by the first sentence in each essay.</Paragraph> <Paragraph position="2"> In commercial settings, it is crucial that a classifier such as the one discussed in Section 3 generalizes across different test questions. New test questions are introduced on a regular basis; so it is important that a classifier that works well for a given data set works well for other data sets as well, without requiring additional annotations and training.</Paragraph> <Paragraph position="3"> For the thesis statement classifier it was important to determine whether the positional, lexical, and RST-specific features are topic independent, and thus generalizable to new test questions. If so, this would indicate that we could annotate thesis statements across a number of topics, and re-use the algorithm on additional topics, without further annotation. We asked a writing expert to manually annotate the thesis statement in approximately 45 essays for 4 additional test questions: Topics A, C, D and E. The annotator completed this task using the same interface that was used by the two annotators in Experiment 1.</Paragraph> <Paragraph position="4"> To test generalizability for each of the five EPT questions, the thesis sentences selected by a writing expert were used for building the classifier. Five combinations of 4 prompts were used to build the classifier in each case, and the resulting classifier was then cross-validated on the fifth topic, which was treated as test data. To evaluate the performance of each of the classifiers, agreement was calculated for each 'cross-validation' sample (single topic) by comparing the algorithm selection to our writing expert's thesis statement selections. For example, we trained on Topics A, C, D, and E, using the thesis statements selected manually.</Paragraph> <Paragraph position="5"> This classifier was then used to select, automatically, thesis statements for Topic B. In the evaluation, the algorithm's selection was compared to the manually selected set of thesis statements for Topic B, and agreement was calculated. Table 3 illustrates that in all but one case, agreement exceeds both baselines from Table 2. In this set of manual annotations, the human judge almost always selected one sentence as the thesis statement. This is why Precision, Recall, and the F-value are often equal in Table 3.</Paragraph> </Section> </Section> class="xml-element"></Paper>