XML Viewer - w98-0303

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0303_metho.xml
Size: 22,126 bytes
Last Modified: 2025-10-06 14:15:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0303">
  <Title>Enriching Automated Essay Scoring Using Discourse Marking</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
* Abstract
</SectionTitle>
    <Paragraph position="0"> Electronic Essay Rater (e-rater) is a prototype automated essay scoring system built at Educational Testing Service (ETS) that uses discourse marking, in addition to syntactic information and topical content vector analyses to automatically assign essay scores. This paper gives a general description ore-rater as a whole, but its emphasis is on the importance of discourse marking and argument partitioning for annotating the argument structure of an essay.</Paragraph>
    <Paragraph position="1"> We show comparisons between two content vector analysis programs used to predict scores. EsscQ/'Content and ArgContent. EsscnContent assigns scores to essays by using a standard cosine correlation that treats the essay like a &amp;quot;'bag of words.&amp;quot; in that it does not consider word order. Ark, Content employs a novel content vector analysis approach for score assignment based on the individual arguments in an essay. The average agreement between ArgContent scores and human rater scores is 82%. as compared to 69% agreement between EssavContent and the human raters. These results suggest that discourse marking enriches e-rater's scoring capability. When e-rater uses its whole set of predictive features, agreement with human rater scores ranges from 87deg,/o - 94% across the 15 sets of essa5 responses used in this study</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="15" type="metho">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> The development of Electronic Essay Rater (e-rater).</Paragraph>
    <Paragraph position="1"> an automated prototype essay scoring system, was motivated by practical concerns of time and costs that limit the number of essay questions on current standardized tests. Literature on automated essay scoring shows that reasonably high agreement can be achieved between a machine score and a human rater score simply by doing analyses based on the number of words in an essay (Page and Peterson (1995)).</Paragraph>
    <Paragraph position="2"> Scoring an essay based on the essay length is not a criterion that can be used to define competent writing. In addition, from a practical standpoint.</Paragraph>
    <Paragraph position="3"> essay length is a highly coachable feature. It doesn't take examinees long to figure out that a computer will assign a high score on an essay based on a pre-specified number of words.</Paragraph>
    <Paragraph position="4"> E-rater's modules extract syntactic and discourse structure information from essays, as well as information about vocabulary content in order to predict the score. The 57 features included in e-rater  are based on writing characteristics specified at each of the six score points in the scoring guide used by human raters for manual scoring (also available at http://www.gmat.org;). For example, the scoring guide indicates that an essay that stays on the topic of the test question, has a strong, coherent and well-organized argument structure, and displays a variety of word use and syntactic structure will receive a score at the higher end of the six-point scale (5 or 6}. Lower scores are assigned to essays as these characteristics diminish.</Paragraph>
    <Paragraph position="5"> Included in e-rater's feature set are features derived from discourse structure, syntactic structure, and topical analysis as they relate to the human scoring guide. For each essay question, e-rater is run on a set of training data (human-scored essay responses) to extract t~.atures. A stepwise linear regression analysis is performed on the features extracted from the training set to determine which ones have significant weights (the predictive features). Final score prediction for cross-validation sets is performed using these predictive features identified in the training sets. Accuracy is determined by measuring agreement between human rater assigned scores and machine predicted scores, which are considered to &amp;quot;agree&amp;quot; if there is no greater than a single point difference on the six-point scale. This is the same criterion used to measure agreement between two human raters.</Paragraph>
    <Paragraph position="6"> Among the strongest predictive features across the essay questions used in this study are the scores generated from ArgContent (a content vector analysis applied to discourse chunked text), and discourse-related surface cue word and non-lexical features. On average, ArgContent alone has 82% agreement with the human rater score as compared to EssavContent's 69%. EssayContent is a content vector analysis program that treats an essay like a &amp;quot;'bag of words.&amp;quot; This suggests two things. First, the discourse markers detected by the argument annotation and partitioning program. APA. are helpful for identification of relevant units of discourse in essay responses. Second. the application of content vector analysis to those text units appears to increase scoring performance. Overall, it appears that discourse marking provides feature information that is useful in e-rater's essay score predictions.</Paragraph>
    <Paragraph position="7"> A long-term goal of automated essay scoring is to be able to generate diagnostic or instructional information, along with a numeric score to a test-taker or instructor. Information about the discourse structure of essays brings us closer to being able to generate informative feedback to test-takers about the essay's cohesion.</Paragraph>
    <Paragraph position="8"> We report on the overall evaluation results from crater's scoring performance on 13 sets of essay data from the Analytical Writing Assessments of the Graduate Management Admissions Test (GMAT) (see http://www.gmat.org/) and 2 sets of essay data from the Test of Written English (TWE) (see http://w.w.w.toefl.or~tstprpmt.html for sample TWE questions). The paper devotes special attention to e-rater's discourse marking and analysis components.</Paragraph>
  </Section>
  <Section position="3" start_page="15" end_page="638" type="metho">
    <SectionTitle>
2. Hybrid Feature Methodology
</SectionTitle>
    <Paragraph position="0"> E-rater uses a hybrid feature approach in that it incorporates several variables that are derived statistically, or extracted through NLP techniques.</Paragraph>
    <Paragraph position="1"> The following sections describe the features used in this study.</Paragraph>
    <Section position="1" start_page="15" end_page="15" type="sub_section">
      <SectionTitle>
2.1 Syntactic Features
</SectionTitle>
      <Paragraph position="0"> The scoring guides indicate that one feature used to evaluate an essay is syntactic variety. Syntactic structures in essays are identified using NLP techniques. All sentences are parsed with the Microsoft Natural Language Processing tool (MSNLP) (see MSNLP(1997)). Examination of the parse trees yields information about syntactic variety with regard to what kinds of clauses or verb types were used by a test-taker.</Paragraph>
      <Paragraph position="1"> A program was implemented to identify the number of complement clauses, subordinate clauses, infinitive clauses, relative clauses and occurrences of tile subjunctive modal auxiliary, verbs, would, could, .~'hould. might and may, tbr each sentence in an essay. Ratios of syntactic structure types per essay and per sentence were calculated as possible measures of syntactic variety.</Paragraph>
    </Section>
    <Section position="2" start_page="15" end_page="16" type="sub_section">
      <SectionTitle>
2.2 Discourse Structure Analysis
</SectionTitle>
      <Paragraph position="0"> GMAT essay questions are of two types: Analysis of an Issue (issue) and Analysis of an Argument (argument). The issue essay asks the writer to respond to a general question and to provide &amp;quot;reasons and'or examples&amp;quot; to support his or her position on an issue introduced by the test question. The argument essay tbcuses the writer on the argument in a given piece of text. using the term argument in tile sense of a rational presentation of points with the purpose of persuading the reader. The scoring guides used for manual scoring indicate that an essay will receive a score based on the examinee's demonstration of a well-developed essay. For the argument essay', for instance, tile scoring guide states that a &amp;quot;'6&amp;quot;&amp;quot; essay &amp;quot;'develops ideas cogently, organizes them logically, and connects them with clear transitions.&amp;quot; The correlate to this for the issue essay would appear to be that a &amp;quot;'6&amp;quot;&amp;quot; essay &amp;quot;'...develops a position on the issue with insightful reasons...&amp;quot; and that the essay &amp;quot;'is clearly well-organized.'&amp;quot; Nolan (I 997) points out that terms in holistic scoring guides, such as &amp;quot;'cogent.&amp;quot; &amp;quot;'logical.&amp;quot; &amp;quot;'insightful.&amp;quot; and &amp;quot;'well-organized&amp;quot; have &amp;quot;'fuzzy&amp;quot; meaning, since they are based on imprecise observation. Nolan uses methods of&amp;quot;fuzzy, logic&amp;quot; to automatically assign these kinds of &amp;quot;fuzzy&amp;quot; classifications to essays. In this study, we try, to identify organization of an essay through automated analysis and identification of the essay's argument structure through discourse marking.</Paragraph>
      <Paragraph position="1">  Since there is no particular text unit that reliably Corresponds to the stages, steps, or passages of an argument, readers of an essay must rely on other things such as surface cue words to identify individual arguments. We found that it was useful to identify rhetorical relations such as Parallelism and Contrast. and content or coherence relations that have more to do with the discourse involved. These relations can appear at almost any level -- phrase, sentence, a chunk consisting of several sentences, or paragraph. Therefore, we developed a program to automatically identify the discourse unit of text using surface cue words and non-lexical cues.</Paragraph>
      <Paragraph position="3"> such us arts. music or social sciences will be most Uffected by this drop in high school population.</Paragraph>
      <Paragraph position="5"> As literature in the field of discourse analysis points out. surface cue words and structures can be identified and used for computer-based discourse analysis (Cohen (1984), (Mann and Thompson (1988), Hovy. et al (1992) Hirschberg and Litman (1993), Vander Linden and Martin (1995), Knott (1996) and Litman (1996)). E-rater's AP.4 module uses surface cue words and non-lexical cues (i.e., syntactic structures) to denote discourse structure in essays. We adapted the conceptual framework of conjunctive relations from Quirk. et al (1985) in ~hich terms, such as &amp;quot;'In summary&amp;quot; and &amp;quot;'In conclusion,&amp;quot; which we consider to be surface cue terms, are classified as conjuncts used tot summarizing. Cue words such as &amp;quot;'perhaps&amp;quot; and &amp;quot;'possibly&amp;quot; are considered to be Belief words used by the writer to express a belief with regard to argument development in essays. Words like &amp;quot;'this&amp;quot; and &amp;quot;'these&amp;quot; may often be used to flag that the writer is developing on the same topic (Sidner (1986)). We also observed that. in certain discourse contexts, nonlexicat, syntactic structure cues, such as infinitive or complement clauses, may characterize the beginning of a new argument.</Paragraph>
      <Paragraph position="6"> The automated argument partitioning and annotation program (APA) was implemented to output a discourse-marked annotated version of each essay in which the discourse marking is used to indicate new arguments (arg_init), or development of an argument (arg_dev). An example of APA annotations is shown in Figure I.</Paragraph>
    </Section>
    <Section position="3" start_page="16" end_page="16" type="sub_section">
      <SectionTitle>
New Paragraph:
</SectionTitle>
      <Paragraph position="0"> ...</Paragraph>
      <Paragraph position="1"> Sentence I: fl is also assumed that shrinking high school enrollment ram, lead to a shortage of qual(fied engineers.</Paragraph>
      <Paragraph position="2"> AP.4&amp;quot;s heuristic rules for discourse marker annotation and argument partitioning are based on syntactic and paragraph-based distribution of surface cue words. phrases and non-lexical cues corresponding to discourse structure. Relevant cue words and terms are contained in a specialized surface cue word and phrase lexicon. In Figure 1, the annotations.</Paragraph>
      <Paragraph position="3"> arg_init#PARALLEL, and arg_dev#DETAIL indicate the rhetorical relations of Parallel structure and Detail intbrmation, respectively, in arguments. The arg_dev~-SAME_TOPIC label denotes the pronoun &amp;quot;'it&amp;quot; as indicating the writer has not changed topics. Tile labels arg_in it-C LA I M_THAT and arg_dev=CLAIM_THAT indicate that a complement clause was used to flag a new argument, or argument development. Arg_aux=SPECULATE flags subjunctive modals that are believed to indicate a writer's speculation. Preliminary analysis of these rules indicates that some rule refinements might be useful; however, more research needs to be done on this. ~ Based on the arg_init flags in the annotated essays, .4P.q outputs a version of the essay partitioned &amp;quot;by argument&amp;quot;. The argument-partitioned versions of essays are input to .4rgContent. tile discourse-driven, topical analysis program described below.</Paragraph>
    </Section>
    <Section position="4" start_page="16" end_page="18" type="sub_section">
      <SectionTitle>
2.3 Topical Analysis
</SectionTitle>
      <Paragraph position="0"> Good essays are relevant to the assigned topic. The}' also tend to use a more specialized and precise vocabulary in discussing the topic than poorer essays do. We should therefore expect a good essay to resemble other good essays in its choice of words and. conversely, a poor essay to resemble other poor ones. E-rater evaluates tile topical content of an ' We thank Mary Dee Harris for her analysis of APA annotated outputs.</Paragraph>
      <Paragraph position="1">  essay bx comparing the words it contains to the words tbund in manually, graded training examples tbr each of the six score categories. Two measures of content similarity are computed, one based on word frequency and the other on word weight, as in information retrieval applications (Salton. 1988). For the former application (EssayContent). content similarit2, is computed over the essay as a ~hole. while in the latter application (ArgComem) content similarities are computed for each argument in an essay.</Paragraph>
      <Paragraph position="2"> For the frequency based measure (the EssavComent program), the content of each score category is converted to a single vector whose elements represent the total frequency of each word in the training essays for that category. In effect, this merges the essays for each score. (A stop list of some function words is removed prior to vector construction.) The system computes cosine correlations between the vector for a given test essay and the six vectors representing the trained categories: the category that is most similar to the test essay is assigned as the evaluation of its content. An advantage of using the cosine correlation is that it is not sensitive to essay length, which ma3 vary' considerably.</Paragraph>
      <Paragraph position="3"> The other content similarity measure..4rgContem, is computed separatel3, for each argument in the test essay and is based on the kind of term weighting used in information retrieval. For this purpose, the word frequency vectors for the six score categories. described above, are converted to vectors of word weights. The ~eight for word i in score category s is:</Paragraph>
      <Paragraph position="5"> where freq,~ is the frequency of word i in category s.</Paragraph>
      <Paragraph position="6"> max_freq~ is the frequency of the most frequent word in x (after a stop list of words has been removed).</Paragraph>
      <Paragraph position="7"> n_essays,o,., is the total number of training essays across all six categories, and n_essays, is the number of training essays containing word i.</Paragraph>
      <Paragraph position="8"> The first part of the weight formula represents the prominence of word i in the score category, and the second part is the log of the word's inverse document frequency (IDF). For each argument a in the test essay, a vector of word weights is also constructed.</Paragraph>
      <Paragraph position="9"> The weight for ~ord i in argument a is</Paragraph>
      <Paragraph position="11"> where freq, ~ is the frequency of word i in argument a.</Paragraph>
      <Paragraph position="12"> and max_freq, is the frequency of the most frequent word in a (once again, after a stop list of words has been removed). Each argument (as it has been partitioned by APA) is evaluated by computing cosine correlations between its weighted vector and those of the six score categories, and the most similar category is assigned to the argument. As a result of this analysis, e-rater has a set of scores (one per argument) for each test essay.</Paragraph>
      <Paragraph position="13"> We were curious to find out if an essay containing several good arguments (each with scores of 5 or 6) and several poor arguments (each with scores of 1 or 2) produced a different overall judgment by the human raters than an essay consisting of uniformly mediocre arguments (3&amp;quot;s or 4&amp;quot;s). or if perhaps humans were most influenced by the best or poorest argument in the essay. In a preliminary study, we looked at how well the minimum, maximum, mode.</Paragraph>
      <Paragraph position="14"> median, and mean of the set of argument scores agreed with the judgments of human raters for the essay as a whole. The mode and the mean showed good agreement with human raters, but the greatest agreernent was obtained from an adjusted mean of the argument scores which compensated for an effect of the number of arguments in the essay. For example, essays which contained only one or two arguments tended to receive slightly lower scores from the human raters than the mean of the argument scores, and essays which contained many arguments tended to receive slightly higher scores than the mean of tile argument scores. To compensate for this, an adjusted mean is used as e-rater's ArgContent.</Paragraph>
      <Paragraph position="16"> 3. Training and Testing  In all. e-rater's syntactic, discourse, and topical analyses yielded a total of 57 features for each essay. The majority of the features in the overall feature set are discourse-related (see Table 3 for some examples). To predict the score assigned by human raters, a stepwise linear regression analysis was used to compute the optimal weights for these predictors based on manually scored training essays. The training sets for each test question consisted of a total  of 270 essays. 5 essays for score 0:, 15 essays for score I (a rating infrequently used by the human raters) and 50 essays each for scores 2 through 6. After training, e-rater analyzed new test essays, and the regression weights were used to combine the measures into a predicted score for each one. E-toter predictions were compared to the two human rater scores to measure exact and adjacent agreement (see Table 1). Figure 2 shows the predictive feature set identified by the regression analysis for one of the example test questions. ARG I, in Tables ! and 2.</Paragraph>
    </Section>
    <Section position="5" start_page="18" end_page="638" type="sub_section">
      <SectionTitle>
Question
3.1 Results
</SectionTitle>
      <Paragraph position="0"> Table I shows the overall results for 8 GMAT argument questions, 5 GMAT issue questions and 2 TWE questions. The level of agreement between e-rater and the human raters ranged from 87% to 94% across the 15 tests. Agreement appears to be comparable to that found between the human raters.</Paragraph>
      <Paragraph position="1">  Table 2 shows that scores generated by ArgContent have higher agreement with human raters than do scores generated by Essaa'Content. This suggests that the discourse structures generated by APA are useful for score prediction, and that the application of content vector analysis to text partitioned into smaller units of discourse might improxe e-rater's overall storing accuracy.</Paragraph>
      <Paragraph position="2">  Results tbr the essay questions in Tables I and 2 represent a wide variety of topics. (Sample questions that show topical variety in GMAT essays can be viewed at http://www.gmat.org/. Topical variety in TWE questions can be reviewed at http://www.toefl.org/tstprpmt.html.) The data also represented a wide range of English writing competency. The majority of test-takers from the two TWE data sets were nonnative English speakers. Despite these differences in topic and writing skill, erater, as well as EssayContenr and ArgContent performed consistently across items. In fact. over the 15 essay questions, the discourse t~atures output by APA and scores output by ArgContent (based on discourse-chunked text) account for the majority of the most frequently occurring predictive features. These are shown in Table 3.</Paragraph>
      <Paragraph position="3"> We believe that the discourse related features used by e-rater might be the most useful building blocks for automated generation of diagnostic and instructional sumnaaries about essays. For example, sentences indicated as &amp;quot;'the beginning of an argument&amp;quot; could be used to flag main points of an essay (Marcu (1997)). ArgContent's ability to generate &amp;quot;'scores&amp;quot; for each argument could provide information about the relevance of individual arguments in an essay, which in turn could be used to generate helpful diagnostic or instructional information.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML