File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1024_metho.xml
Size: 20,980 bytes
Last Modified: 2025-10-06 14:08:53
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1024"> <Title>High Low Total Precision Recall F-measure Precision Recall F-measure Accuracy</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> CriterionSM Online Essay Evaluation Service is an appli- </SectionTitle> <Paragraph position="0"> cation for writing instruction which includes a capability to annotate sentences in student essays with discourse element labels. These labels include the categories Thesis Statement, Main Idea, Supporting Idea, and Conclusion (Burstein et al., 2003b). Though it accurately annotates sentences with essay-based discourse labels, Criterion does not provide an evaluation of the expressive quality of the sentences that comprise a discourse segment. The system might accurately label a student's essay as having all of the typically expected discourse elements: thesis statement, 3 main ideas, supporting evidence linked to each main idea, and a conclusion. As teachers have pointed out, however, an essay may have all of these organizational elements, but the quality of individual elements may need improvement.</Paragraph> <Paragraph position="1"> In this paper, we present a capability that captures expressive quality of sentences in the discourse segments of an essay. For this work, we have defined expressive quality in terms of four aspects related to global and local essay coherence. The first two dimensions capture global coherence, and the latter two relate to local coherence: a) relatedness to the essay question (topic), b) relatedness between discourse elements, c) intra-sentential quality, and d) sentence-relatedness within a discourse segment. Each dimension represents a different aspect of coherence.</Paragraph> <Paragraph position="2"> Essentially, the goal of the system is to be able to predict whether a sentence in a discourse segment has high or low expressive quality with regard to a particular coherence dimension. We have deliberately developed an approach to essay coherence that is comprised of multiple dimensions, so that an instructional application may provide appropriate feedback to student writers, based on the system's prediction of high or low for each dimension. For instance, sentences in the student's thesis statement may have a strong relationship to the essay topic, but may have a number of serious grammatical errors that make it hard to follow. For this student, we may want to point out that on the one hand, the sentences in the thesis address the topic, but the thesis statement as a discourse segment might be more clearly stated if the grammar errors were fixed. By contrast, the sentences that comprise the student's thesis statement may be grammatically correct, but only loosely related to the essay topic. For this student, we would also want the system to provide appropriate feedback to, so that the student could revise the thesis statement text appropriately.</Paragraph> <Paragraph position="3"> In earlier work, Foltz, Kintsch & Landauer (1998), and Wiemer-Hastings & Graesser (2000) have developed systems that also examine coherence in student writing. Their systems measure lexical relatedness between text segments by using vector-based similarity between adjacent sentences. This linear approach to similarity scoring is in line with the TextTiling scheme (Hearst and Plaunt, 1993; Hearst, 1997), which may be used to identify the subtopic structure of a text.</Paragraph> <Paragraph position="4"> Miltsakaki and Kukich (2000) have also addressed the issue of establishing the coherence of student essays, using the Rough Shift element of Centering Theory. Again, this previous work looks at the relatedness of adjacent text segments, and does not explore global aspects of text coherence. null Hierarchical models of discourse have been applied to the question of coherence (Mann and Thompson, 1986), but so far these have been more useful in language generation than in determining how coherent a given text is, or in identifying the specific problem, such as the breakdown of coherence in a document.</Paragraph> <Paragraph position="5"> Our approach differs in fundamental ways from this earlier work that deals with student writing. First, Foltz et al. (1998), Wiemer-Hastings and Graesser (2000), and Miltsakaki and Kukich (2000) assume that text coherence is linear. They calculate the similarity between adjacent segments of text. By contrast, our approach considers the discourse structure in the text, following Burstein et al. (2003b). Our method considers sentences with regard to their discourse segments, and how the sentences relate to other text segments both inside (such as the essay thesis) and outside (such as the essay topic) of a document. This allows us to identify cases in which there may be a breakdown in coherence due to more global aspects of essay-based discourse structure. Second, previous work has used Latent Semantic Analysis as a semantic similarity measure (Landauer and Dumais, 1997). We have adapted another vector-based method of semantic representation: Random Indexing (Kanerva et al., 2000; Sahlgren, 2001). Another difference between our system and earlier systems is that we use essays manually annotated on the four coherence dimensions to train our system.</Paragraph> <Paragraph position="6"> The final system employs a hybrid approach to classify the first two of the four coherence dimensions with a high or low quality rank. For these dimensions, a support vector machine is used to model features derived from Random Indexing and from essay-based discourse structure information. A third local coherence dimension component is driven by rule-based heuristics. A fourth dimension related to coherence within a discourse segment cannot be classified due to a lack of data characterizing low expressive quality. This is fully explained later in the paper. null</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Protocol Development and Human </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Annotation 2.1 Protocol Development </SectionTitle> <Paragraph position="0"> The development of this system required a corpus of human annotated essay data for modeling purposes. In the end, the goal is to have the system make judgments similar to those made by a human with regard to ranking the coherence of an essay on four dimensions. Therefore, we created a detailed protocol for annotating the expressive quality of essay-based discourse elements in essays with regard to four aspects related to global and local essay coherence. This protocol was designed for the following purposes: 1. To yield annotations that are useful for the purpose of providing students with feedback about the expressive relatedness of discourse elements in their essays, given four relatedness dimensions; 2. To permit human annotators to achieve high levels of consistency during the annotation process; 3. To produce annotations that have the potential of be- null ing derivable by computer programs through training on corpora annotated by humans.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1.1 Expressive Quality of Discourse Segments: Protocol Description </SectionTitle> <Paragraph position="0"> According to writing experts who collaborated in this work, the expressive relatedness of a sentence discourse element may be characterized in terms of four dimensions: a) relationship to prompt (essay question topic), b) relationship to other discourse elements, c) relevance with discourse segment, and d) errors in grammar, usage, and mechanics. For the sake of brevity, we refer to these four dimensions as DimP (relatedness to prompt), DimT (typically, relatedness to thesis), DimS (relatedness within a discourse segment), and DimERR.</Paragraph> <Paragraph position="1"> The two annotators were required to label each sentence of an essay for expressive quality on the four dimensions (above). For the 989 essays used in this study, each sentence had already been manually annotated with these discourse labels: background material, thesis, main idea, supporting idea, and conclusion (Burstein et al., 2003b).1 An assignment of high (1) or low (0) was given to each sentence, on the dimensions relevant to the discourse element. Not all dimensions apply to all discourse elements. The protocol is extremely specific as to how annotators should label the expressive quality for each sentence in a discourse element with regard to the four dimensions. In this paper, we provide a brief description of the labeling protocol, so that the purpose of each dimension is clear.</Paragraph> <Paragraph position="2"> Figure 1 shows a sample essay and prompt. A human judge has assigned a label to each sentence in the essay, resulting in the illustrated division into discourse segments. In addition, the figure indicates human annotators' ratings for two of our coherence dimensions (DimP and DimT , discussed below). By and large, the essay consistently follows up on the ideas of the essay thesis, and so most sentences get a high relatedness score on DimT . However, much of the essay fails to directly address the question posed in the essay prompt, and so many sentences are assigned low relatedness on DimP .</Paragraph> <Paragraph position="3"> The text of the discourse element and the prompt (text of the essay question) must be related. Specifically, the thesis statement, main ideas, and conclusion statement should all contain text that is strongly related to the essay topic. If this relationship does not exist, this is perhaps evidence that the student has written an off-topic essay.</Paragraph> <Paragraph position="4"> For this dimension, a high rank is assigned to each sentence from background material, thesis, main idea and conclusion statement that is related to the prompt text; otherwise a low rank is assigned.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Discourse Sentence DimP DimT Segment </SectionTitle> <Paragraph position="0"> Prompt Images of beauty-both male and female-are promoted in magazines, in movies, on billboards, and on television. Explain the extent to which you think these images can be beneficial or harmful.</Paragraph> <Paragraph position="1"> Background A lot of people really care about how they look or how other people look. Low High A lot of people like reading magazines or watch t.v about how you can fix your looks if you don't like the way your looks are. High High Thesis People that care about how they look is because they have problems at home, their parents don't pay attention to them or even that they have a high self-steem which that is not good. Low N/A A lot of people get to the extent of killing themselfs just because they're not happy with there looks. Low N/A Support Many people go thru make-overs to experiment how they will look but, some people stilldon't like themself. N/A High</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Main Point </SectionTitle> <Paragraph position="0"> The people that don't like themselfs need some helps and they probably feel like that because they have told them oh! your ugly , you look like Blank! or maybe a guy never ask a her out.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Low Low Support </SectionTitle> <Paragraph position="0"> In case of a guy probably the same comments but he won't dare to ask a girl out because he feels that the girl is going to say no because of the way he looks. N/A High Things like this make people don't like each other. N/A High Conclusion I suggest that a those people out here that are not happy with their looks get some help. Low HighTheirs alot of programs that you can get help. Low Low The relationship between a discourse element and other discourse elements in the text governs the global coherence of the essay text. For a text to hold together, certain discourse elements must be related or the text will appear choppy and will be difficult to follow. Specifically, a high rank is assigned to each sentence in the background material, main ideas and conclusion that is related to the thesis, and supporting idea sentences that relate to the relevant main idea. A conclusion sentence may also be given a high rank if it is related to a main idea or background information. Low ranks are assigned to sentences that do not have these relationships.</Paragraph> <Paragraph position="1"> This dimension indicates the cohesiveness of the multiple sentences in a discourse segment of a text. This dimension distinguishes a text segment that may go off task within a discourse segment. For this dimension, a high rank was assigned to each sentence in a discourse segment that related to at least one other sentence in the segment; otherwise the sentence received a low rank. If the discourse segment contained only one sentence, then the DimT label was assigned as the default.</Paragraph> <Paragraph position="2"> Dimension 4 measures a sentence's relatedness of expression with regard to grammar, mechanics and word usage. More specifically, a sentence is considered to be low on this dimension if it contains frequent patterns of error, defined as follows: (a) contains 2 errors in grammar, word usage or mechanics (i.e., spelling, capitalization or punctuation), (b) is an incomplete sentence, or (c) is a run-on sentence (i.e., 4 or more independent clauses within a sentence).</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Topics, Human Annotation, and Human Agreement </SectionTitle> <Paragraph position="0"> Essays written to two genres were used: five of the topics were persuasive, and one was expository. Persuasive writing requires the reader to state an opinion on a particular topic, support the stated opinion, and convince the reader that the perspective is valid and well-supported.</Paragraph> <Paragraph position="1"> An expository topic requires the writer only to state an opinion on a topic. This typically elicits more personal and descriptive writing. Four of the five sets of persuasive essay responses were written by college freshman, and the fifth by 12th graders. The set of expository responses were also written by 12th graders.</Paragraph> <Paragraph position="2"> Two human judges participated in this study. The judges were instructed to assign relevant dimension labels to each sentence. Pre-training of the judges was done using a set of approximately 50 essays across the six topics in the study. During this phase, the authors and the judges discussed and labeled the essays together. During the next training phase, the judges labeled a total of 292 essays across six topics. They labeled the identical set of essays, and were allowed to discuss their decisions. In the next annotation phase, the judges did not discuss their annotations. In this post-training phase (annotation phase), each judge labeled an average of about 278 unique essays for each of four prompts (556 essays together). Each judge also labeled an additional set of 141 essays that was overlapping. So, about 20 percent of the data annotated by each judge in the annotation phase was overlapping, dimensions--data from annotation phase and 80 percent was unique. The 20 percent is used to obtain human agreement.2 During both the training and annotation phases, Kappa statistics were run on their judgments regularly, and if the Kappa for any particular category fell below 0.8, then the judges were asked to review the protocol until their agreement was acceptable. At the end of the annotation phase, we had a total of 989 labeled essays: 292 (training phase) + 278 2 (unique essays from annotator 1 + annotator 2, annotation phase) + 141 (overlapping set, annotation phase).</Paragraph> <Paragraph position="3"> Human Judge Agreement It is critical that the annotation process yields agreement that is high enough between human judges, such that it suggests that people can agree on how to categorize the discourse elements. As is stated in the above section, during the training of the judges for this study, Kappa statistics were computed on a regular basis. Kappa between the judges for each category had to be maintained at least 0.8, since this is believed to represent strong agreement (Krippendorff, 1980). In Table 1 we report human agreement for overlapping data from the four topics on all four dimensions. Clearly, the level of human agreement is quite high across all four coherence dimensions. In addition, if we look at kappas of sentences based on discourse category, no kappa falls below 0.9.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Method </SectionTitle> <Paragraph position="0"> Our final system uses a hybrid approach to label three of the four coherence dimensions. For DimP and DimT , assigning coherence judgments to sentences in an essay proceeds in three stages 1) identifying the discourse label associated with each sentence in an essay, 2) computing features that quantify the semantic similarity between different discourse segments of the essay, and 3) applying a classifier to make a coherence judgment on a dimension.</Paragraph> <Paragraph position="1"> Consistent with the human annotated data, a coherence judgment on any dimension is either &quot;high&quot; or &quot;low.&quot; The method for DimERR is rule-based, and is discussed later.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Discourse element feature identification </SectionTitle> <Paragraph position="0"> As noted earlier, the two human judges in this study annotated the four coherence dimensions according to the hu2For the annotation phase, we were unable to collect data for two essay prompts because of our annotators' availability.</Paragraph> <Paragraph position="1"> This means that we only have inter-annotator agreement statistics on 4 prompts, although some data from all six prompts was available for training and testing our models (with the extra two prompts being represented in the training phase of annotation).</Paragraph> <Paragraph position="2"> man discourse label assignments. Accordingly, we also used the human assigned discourse labels as features for predicting coherence judgments. In a deployed system, however, we would use discourse element labels generated from Criterion's discourse analysis system (Burstein et al., 2003b). Further evaluation is, of course, necessary in order to determine the effect of using these automatically assigned labels in place of the gold standard discourse labels.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Semantic similarity features </SectionTitle> <Paragraph position="0"> Given the partition of an essay into discourse segments, we then derive a set of features from the essay in order to predict how closely related each sentence is to various important text segments, such as the essay topic, and discourse elements, such as thesis statement. As described in Section 4, the features that are most useful for classifying sentences according to coherence are semantic similarity features derived from Random Indexing (Kanerva et al., 2000; Sahlgren, 2001). Random Indexing is a vector-based semantic representation system similar to Latent Semantic Analysis. Our Random Indexing (RI) semantic space is trained on about 30 million words of newswire text.</Paragraph> <Paragraph position="1"> When we extract a feature such as &quot;RI similarity to prompt&quot; for a sentence, this essentially measures to what extent the sentence contains terms in the same semantic domain as compared to those found in the prompt. Within any discourse segment, any semantic information that is word-order dependent is lost.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Support vector classification </SectionTitle> <Paragraph position="0"> Finally, for each sentence in the essay we use the features derived from the essay to make a determination as to whether it meets our criteria for coherence in these dimensions (DimP and DimT ). To make this determination, we use a support vector machine (SVM) classifier (Vapnik, 1995; Christianini and Shawe-Taylor, 2000). Specifically, we use an SVM with a radial basis function kernel, which exhibited good performance on a subset of about 30 essays from the pre-training data.</Paragraph> </Section> </Section> class="xml-element"></Paper>