File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1026_metho.xml

Size: 14,170 bytes

Last Modified: 2025-10-06 14:14:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1026">
  <Title>An Automatic Scoring System For Advanced Placement Biology Essays</Title>
  <Section position="3" start_page="174" end_page="174" type="metho">
    <SectionTitle>
2 The Poor classification is not an official AP
</SectionTitle>
    <Paragraph position="0"> classification. It was used in this study to distinguish the Excellent essays with scores of 9 and l0 from essays with lower end scores in the 0 - 3 range.</Paragraph>
    <Paragraph position="1"> assignment are highly constrained. Essentially, the  essay can be treated as a sequence of short-answer responses. Given our preliminary successes with test questions that elicit multiple responses from examinees, similar scoring methods were applied for scoring AP Biology essay. The results show 87% agreement for exact scores between human rater and computer scores, and 94% agreement for exact or adjacent scores between human rater and computer scores.</Paragraph>
    <Paragraph position="2"> This work is also applicable for other types of assessment as well, such as for employee training courses in corporate and government settings.</Paragraph>
    <Paragraph position="3"> Since the methods discussed in this paper describe techniques for analysis of semantic information in text, presumably this application could be extended to public informational settings, in which people might key in &amp;quot;requests for information&amp;quot; in a number of domains. In particular, these methods could be successfully applied to the analysis of natural language responses for highly constrained domains, such as exist in scientific or technical fields.</Paragraph>
  </Section>
  <Section position="4" start_page="174" end_page="175" type="metho">
    <SectionTitle>
SYSTEM TRAINING
</SectionTitle>
    <Paragraph position="0"> One hundred Excellent essays from the original 200 essays were selected to train the scoring system. The original 200 essays were divided into a training set and test set, selected arbitrarily from the lowest examinee identification number. Only 85 of the original 100 in the test set were included in the study due to illegibility, or use of diagrams instead of text to respond to the question. For convenience during training, and later, for scoring, essays were divided up by section, as specified in the scoring guide (see Figure 1), and stored in directories by essay section. Specifically, the Part A's of the essays were stored in a separate directory, as were Part B's, and Part C's.</Paragraph>
    <Paragraph position="1"> Examinees typically partitioned the essay into sections that corresponded to the scoring guide.</Paragraph>
    <Paragraph position="2"> System training involved the following steps that are discussed in subsequent sections: a) manual lexicon development, b) automatic generation of concept-structure representation (CSR), c) manual creation of a computer-based rubric, d) manual CSR &amp;quot;fine-tuning&amp;quot;, e) automatic rule generation, and f) evaluation of training process.</Paragraph>
    <Section position="1" start_page="174" end_page="175" type="sub_section">
      <SectionTitle>
Lexicon Development
</SectionTitle>
      <Paragraph position="0"> Example-based approaches to lexicon development have been shown to effectively exemplify word meaning within a domain (Richardson, et al., 1993, and Tsutsumi 1992). It has been further pointed out by Wilks, et al, 1992, that word senses can be effectively captured on the basis of textual material, The lexic, on dwC/lopcd for this study used an example-based approach to compile a list of lexical items that characterized the content vocabulary used in the domain of the test question (i.e., gel electrophoresis). The lexicon is composed of words and terms from the relevant vocabulary of the essays used for training.</Paragraph>
      <Paragraph position="1"> To build the lexicon, all words and terms considered to contribute to the core meaning of each relevant sentence in an essay, were included in the lexicon. The decision with regard to whether or not a sentence was relevant was based on information provided in the scoring guide (in Figure 1). For instance, in the sentence, &amp;quot;Smaller DNA fragments mave faster than larger ones.&amp;quot;, the terms Smaller, DNA, fragments, move, faster, larger are considered to be the most meaningful terms in the sentence. This is based on the criteria for a correct response for the Rate/Size category in the scoring guide.</Paragraph>
      <Paragraph position="2"> Each lexical entry contained a superordinate concept and an associated list of metonyms.</Paragraph>
      <Paragraph position="3"> Metonyms are words or terms which are acceptable substitutions for a given word or term (Gerstl, 1991). Metonyms for concepts in the domain of this test question were selected from the example responses in the training data This paradigm was used to identify word similarity in the domain of the essays. For instance, the scoring program needed to recognize that sentences, such as Smaller DNA fragments move faster than larger ones and The smaller segments of DNA will travel more quickly than the bi~.~er ones, contain alternate words with similar meanings in the test question domain. To determine alternate words with similar meanings, metonyms for words, such as fragments and move were established in the  lexicon so that the system could identify which words had similar meanings in the test item domain. The example lexical entries in (1) illustrate that the words fragment and segment are metonyms in this domain, as well as the words move and travel. In (1), FRAGMENT and MOVE are the higher level lexical concepts. The associated metonymsfor FRAGMENT and MOVE are in adjacent lists illustrated in (1).</Paragraph>
      <Paragraph position="4"> (1). Sample Lexical Entries wouM be digested only once, leaving 2 pieces.&amp;quot;, and &amp;quot;The DNA fragment wouM only have 2 segments,&amp;quot; the phrases DATA segment and DNA fragment are paraphrases of each other, and 2 pieces and 2 segments are paraphrases of each other. These sentences are represented by the CSR in (2a) and in (2b).</Paragraph>
      <Paragraph position="5"> (2)a. NP: \[DNA,FRAGMENT\] NP: \[TWO,FRAGMENT\] FRAGMENT \[fragment particle segment...\] MOVE \[ move travel pass pull repel attract ...\] In the final version of the CSR, phrasal constituents are reduced to a general XP node, as is illustrated in Concept-Structure Representations (CSR) Obviously, no two essays will be identical, and it is unlikely that two sentences in two different essays will be worded exactly alike. Therefore, scoring systems must be able to recognize paraphrased information in sentences across essay responses.. To identify paraphrased information in sentences, the scoring system must be able to identify similar words in consistent syntactic patterns. As, Montemagni and Vanderwende (1993) have also pointed out, structural patterns are more desirable than string patterns for capturing semantic information from text. We have implemented a concept-extraction program for preprocessing of essay data that outputs conceptual information as it exists in the structure of a sentence. The program reads in a parse tree generated by MicrosoR's Natural Language Processing Tools (MSNLP) for each sentence in an essay) The program substitutes words in the parse tree with superordinate concepts from the lexicon, and extracts the phrasal nodes containing these concepts. (Words in the phrasal node which do not match a lexical concept are not included in the set of extracted phrasal nodes.) The resulting structures are CSRs. Each CSR represents a sentence according to conceptual content and phra~l constituent structure. CSRs characterize paraphrased information in sentences. For example, in the sentences &amp;quot;The DNA segment (2)b..XP: \[DNA,FRAGMENT\]</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="175" end_page="177" type="metho">
    <SectionTitle>
XP: ITWO,FRAGMENTI
</SectionTitle>
    <Paragraph position="0"> Since phrasal category does not have to be specified, the use of a generalized XP node minimizes the number of required lexical entries, as well as the number of concept grammar rules needed for the scoring process.</Paragraph>
    <Paragraph position="1"> The Computer Rubric Recall that a rubric is a scoring key. Rubric categories are the criteria that determine a correct response. A computer-based rubric was manually created for the purpose of classifying sentences in essays by rubric category during the automated scoring process. Computer rubric categories are created for the bulleted categories listed in the human rater scoring guide illustrated in Figure 1.  Accordingly, the computer-rubric categories were the following. For Part A, the categories were Electricity, Charge, Rate~size, Calibration, Resolution, and Apparatus. For Part B the categories were, Treatment I, Treatment 2, Treatment 3, and Treatment IV. For Part C1, the categories were: Recognition, Cutting, Alternate, and Detail Point. For Part C2, the categories were Change in l, Change in II, Alternate, and Detail Point. Each computer-rubric category exists as an electronic file and contains the related concept grammar rules used during the scoring process.</Paragraph>
    <Paragraph position="2"> The concept grammar rules are described later in the paper.</Paragraph>
    <Paragraph position="3"> Fine-Tuning CSRs CSRs were generated for all sentences in an essay. During training, the CSRs of relevant sentences from the training set were placed into computer-rubric category files. Relevant sentences in essays were sentences identified in the scoring guide as containing information relevant to a rubric category. For example, the representation for the sentence, &amp;quot;The DNA fragment would only have 2 segments,&amp;quot; was placed in the computer rubric category file for Treatment II.</Paragraph>
    <Paragraph position="4"> Typically, CSRs are generated with extraneous concepts that do not contribute to the core meaning of the response. For the purpose of concept grammar rule generation, each CSR from the training data must contain only concepts which denote the core meaning of the sentence.</Paragraph>
    <Paragraph position="5"> Extraneous concepts had to be removed before the rule generation process, so that the concept-structure information in the concept grammar rules would be precise.</Paragraph>
    <Paragraph position="6"> The process of removing extraneous concepts from the CSRs is currently done manually. For this study, all concepts in the CSR that were considered to be extraneous to the core meaning of the sentence were removed by hand. For example, in the sentence, The DNA segment would be digested only once, leaving 2 pieces, the CSR in (3) was generated. For Treatment \]I, the scoring guide indicates that if the sentence makes a reference to 2 fragments that it should receive one point. (The word, piece, is a metonym for the concept, fragment, so these two words may be used interchangably.) The CSR in (3) was generated by the concept-extraction program. The CSR in (4) (in which XP:\[DNA,FRAGMENT\] was removed) illustrates the fine-tuned version of the CSR in (3). The CSR in (4) was then used for the rule generation process, described in the next section.  At this point in the process, each computer rubric categow is an electronic file which contains finetuned, CSRs. The CSRs in the computer rubric categories exemplify the information required to receive credit for a sentence in a response. We have developed a program that automatically generates rules from CSP.s by generating permutations of each CSR The example rules in (5) were generated from the CSR in (4). The rules in (5) were used during automated scoring (described in the following section).</Paragraph>
    <Paragraph position="7"> which looks for matches between CSRs and/or subsets of CSRs, and concept grammar rules in rubric categories associated with each essay part. Recall that CSRs often have extraneous concepts that do not contribute to the core meaning of the sentence. Therefore, the scoring program looks for matches between concept grammar rules and subsets of CSRs, if no direct match can be found for the complete set of concepts in a CSR. The scoring program assigns points to an essay as rule matches are found, according to the scoring guide (see Figure 1). A total number of points is assigned to the essay after the program has looked at all sentences in an essay. Essays receiving a total of at least 9 points are classified as Excellent, essays with 3 points or less are classified as Poor, and essays with 4 - 8 points are classified as &amp;quot;Not Excellent.&amp;quot; The example output in Appendix 1 illustrates matches found between sentences in the essay and the rubric rules from an Excellent essay. (5)a. XP:\[TWO, FRAGMENT\] b. XP:\[FRAGMENT,TWO\] The trade-off for generating rules automatically in this manner is rule overgeneration, but this does not appear to be problematic for the automated scoring process. Automated rule generation is significantly faster and more accurate than writing the rules by hand. We estimate that it would have taken two people about two weeks of full-time work to manually create the rules. Inevitably, there would have been typographical errors and other kinds of &amp;quot;human error&amp;quot;. It takes approximately 3 minutes to automatically generate the rules.</Paragraph>
  </Section>
  <Section position="6" start_page="177" end_page="177" type="metho">
    <SectionTitle>
AUTOMATED SCORING
</SectionTitle>
    <Paragraph position="0"> The 85 remaining Excellent test essays and a set of 20 Poor essays used in this study were scored.</Paragraph>
    <Paragraph position="1"> First, all sentences in Parts A, B and C of each essay were parsed using MSNLP. Next, inflectional suffixes were automatically removed from the words in the parsed sentences, since inflectional suffixed forms are not included in the lexicon. CSRs were automatically generated for all sentences in each essay. For each part of the essay, the scoring program uses a searching algorithm</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML