File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1032_metho.xml
Size: 11,857 bytes
Last Modified: 2025-10-06 14:14:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1032"> <Title>Automated Scoring Using A Hybrid Feature Identification Technique</Title> <Section position="3" start_page="206" end_page="209" type="metho"> <SectionTitle> 2. Hybrid Feature Methodology </SectionTitle> <Paragraph position="0"> E-rater uses a hybrid feature methodology that incorporates several variables either derived statistically, or extracted through NLP techniques. The final linear regression model used for predicting scores includes syntactic, rhetorical and topical features. The next three sections present a conceptual rationale and a description of feature identification in essay responses.</Paragraph> <Section position="1" start_page="206" end_page="206" type="sub_section"> <SectionTitle> 2.1 Syntactic Features </SectionTitle> <Paragraph position="0"> The scoring guides indicate that one feature used to evaluate an essay is syntactic variety.</Paragraph> <Paragraph position="1"> All sentences in the essays were parsed using the Microsoft Natural Language Processing tool (MSNLP) (see MSNLP (1997)) so that syntactic structure information could be accessed. The identification of syntactic structures in essay responses yields information about the syntactic variety in an essay with regard to the identification of clause or verb types.</Paragraph> <Paragraph position="2"> A program was implemented to identify the number of complement clauses, subordinate clauses, infinitive clauses, relative clauses and occurrences of the subjunctive modal auxiliary verbs, would, could, should, might and may, for each sentence in an essay. Ratios of syntactic structure types per essay and per sentence were also used as measures of syntactic variety.</Paragraph> </Section> <Section position="2" start_page="206" end_page="207" type="sub_section"> <SectionTitle> 2.2 Rhetorical Structure Analysis </SectionTitle> <Paragraph position="0"> GMAT essay questions are of two types: Analysis of an Issue (issue) and Analysis of an Argument (argument). The GMAT issue essay asks the writer to respond to a general question and to provide &quot;reasons and/or examples&quot; to support his or her position on an issue introduced by the test question. The GMAT argument essay focuses the writer on the argument in a given piece of text, using the term argument in the sense of a rational presentation of points with the purpose of persuading the reader. The scoring guides indicate that an essay will receive a score based on the examinee's demonstration of a well-developed essay. In this study, we try to identify organization of an essay through automated analysis and identification of the rhetorical (or argument) structure of the essay.</Paragraph> <Paragraph position="1"> Argument structure in the rhetorical sense may or may not correspond to paragraph divisions.</Paragraph> <Paragraph position="2"> One can make a point in a phrase, a sentence, two or three sentences, a paragraph, and so on.</Paragraph> <Paragraph position="3"> For automated argument identification, e-rater identifies 'rhetorical' relations, such as Parallelism and Contrast that can appear at almost any level of discourse. This is part of the reason that human readers must also rely on cue words to identify new arguments in an essay.</Paragraph> <Paragraph position="4"> Literature in the field of discourse analysis supports our approach. It points out that rhetorical cue words and structures can be identified and used for computer-based discourse analysis (Cohen (1984), (Mann and Thompson (1988), Hovy, et al (1992), Hirschberg and Litman (1993), Vander Linden and Martin (1995), and Knott (1996)). E-rater follows this approach by using rhetorical cue words and structure features, in addition to other topical and syntactic information. We adapted the conceptual framework of conjunctive relations from Quirk, et ai (1985) in which cue terms, such as &quot;In summary&quot; and &quot;In conclusion,&quot; are classified as conjuncts used for summarizing. Cue words such as &quot;perhaps,&quot; and &quot;possibly&quot; are considered to be &quot;belief&quot; words used by the writer to express a belief in developing an argument in the essay. Words like &quot;this&quot; and &quot;these&quot; may often be used to flag that the writer has not changed topics (Sidner (1986)). We also observed that in certain discourse contexts structures such as infinitive clauses mark the beginning of a new argument.</Paragraph> <Paragraph position="5"> E-rater's automated argument partitioning and annotation program (APA) outputs an annotated version of each essay in which the argument units of the essays are labeled with regard to their status as &quot;marking the beginning of an argument,&quot; or &quot;marking argument development.&quot; APA also outputs a version of the essay that has been partitioned &quot;by argument&quot;, instead of &quot;by paragraph,&quot; as it was originally partitioned by the test-taker. APA uses rules tbr argument annotation and partitioning based on syntactic and paragraph-based distribution of cue words, phrases and structures to identify rhetorical structure.</Paragraph> <Paragraph position="6"> Relevant cue words and terms are stored in a cue word lexicon.</Paragraph> </Section> <Section position="3" start_page="207" end_page="207" type="sub_section"> <SectionTitle> 2.3 Topical Analysis </SectionTitle> <Paragraph position="0"> Good essays are relevant to the assigned topic.</Paragraph> <Paragraph position="1"> They also tend to use a more specialized and precise vocabulary in discussing the topic than poorer essays do. We should therefore expect a good essay to resemble other good essays in its choice of words and, conversely, a poor essay to resemble other poor ones. E-rater evaluates the lexical and topical content of an essay by cornparing the words it contains to the words found in manually graded training examples for each of the six score categories.</Paragraph> <Paragraph position="2"> Two programs were implemented that compute measures of content similarity, one based on word frequency (EssayContent) and the other on word weight (ArgContent), as in information retrieval applications (Salton (1988)).</Paragraph> <Paragraph position="3"> In EssayContent, the vocabulary of each score category is converted to a single vector whose elements represent the total frequency of each word in the training essays for that category. In effect, this merges the essays for each score. (A stop list of some function words is removed prior to vector construction.) The system computes cosine correlations between the vector for a given test essay and the six vectors representing the trained categories; the category that is most similar to the test essay is assigned as the evaluation of its content. An advantage of using the cosine correlation is that it is not sensitive to essay length, which may vary considerably.</Paragraph> <Paragraph position="4"> The other content similarity measure, is computed separately by ArgContent for each argument in the test essay and is based on the kind of term weighting used in information retrieval. For this purpose, the word frequency vectors for the six score categories, described above, are converted to vectors of word weights. The weight for word i in score category s is:</Paragraph> <Paragraph position="6"> where freq,.,, is the frequency of word i in category s, max_freq~ is the frequency of the most frequent word in s (after a stop list of words has been removed), n_essaystot,,i is the total number of training essays across all six categories, and n_essays~ is the number of training essays containing word i.</Paragraph> <Paragraph position="7"> The first part of the weight formula represents the prominence of word i in the score category, and the second part is the log of the word's inverse document frequency. For each argument in the test essay, a vector of word weights is also constructed. Each argument is evaluated by computing cosine correlations between its weighted vector and those of the six score categories, and the most similar category is assigned to the argument. As a result of this analysis, e-rater has a set of scores (one per argument) for each test essay.</Paragraph> <Paragraph position="8"> In a preliminary study, we looked at how well the minimum, maximum, mode, median, and mean of the set of argument scores agreed with the judgments of human raters for the essay as a whole. The greatest agreement was obtained from an adjusted mean of the argument scores that compensated for an effect of the number of arguments in the essay. For example, essays which contained only one or two arguments tended to receive slightly lower scores from the human raters than the mean of the argument scores, and essays which contained many arguments tended to receive slightly higher scores than the mean of the argument scores. To compensate for this, an adjusted mean is used as 3. Training and Testing In all, e-rater's syntactic, rhetorical, and topical analyses yielded a total of 57 features for each essay. The training sets for each test question consisted of 5 essays for score 0, 15 essays for score 1, and 50 essays each for scores 2 through 6. To predict the score assigned by human raters, a stepwise linear regression analysis was used to compute the optimal weights for these predictors based on manually scored training essays. For example, Figure 1, below, shows the predictive feature set generated for the ARGI test question (see results in Table 1). The predictive feature set for ARGI illustrates how criteria specified for manual scoring described earlier, such as argument topic and development (using the ArgContent score and argument development terms), syntactic structure usage, and word usage (using the EssayContent score), are represented by e-rarer. After training, e-rater analyzed new test essays, and the regression weights were used to combine the measures into a predicted score for each one. This prediction was then compared to the scores assigned by two human raters to check for exact or acljacent agreement.</Paragraph> </Section> <Section position="4" start_page="207" end_page="209" type="sub_section"> <SectionTitle> ARG1 Test Question 3.1 Results </SectionTitle> <Paragraph position="0"> Table 1 shows the overall results for 8 GMAT argument questions, 5 GMAT issue questions and 2 TWE questions. There was an average of 638 response essays per test question. E-rater and human rater mean agreement across the 15 data sets was 89%. In many cases, agreement was as high as that found between the two human raters.</Paragraph> <Paragraph position="1"> The items that were tested represented a wide variety of topics (see http://www.gmat.org/ for GMAT sample questions and http://www.toetl.org/tstprpmt.htm! for sample TWE questions). The data also represented a wide variety of English writing competency. In fact, the majority of test-takers from the 2 TWE data sets were nonnative English speakers. Despite these differences in topic and writing skill e-rater performed consistently well across items. To determine the features that were the most reliable predictors of essay score, we examined the regression models built during training. A feature type was considered to be a reliable predictor if it proved to be significant in at least 12 of the 15 regression analyses. Using this criterion, the most reliable predictors were the ArgContent and EssayContent scores, the number of cue words or phrases indicating the development of an argument, the number of syntactic verb and clause types, and the number of cue words or phrases indicating the beginning of an argument.</Paragraph> </Section> </Section> class="xml-element"></Paper>