File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0411_intro.xml

Size: 2,621 bytes

Last Modified: 2025-10-06 14:06:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0411">
  <Title>Automated Essay Scoring for Nonnative English Speakers</Title>
  <Section position="3" start_page="70" end_page="71" type="intro">
    <SectionTitle>
3. E-rater Agreement Performance on
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="70" end_page="71" type="sub_section">
      <SectionTitle>
Nonnative Speaker Data
</SectionTitle>
      <Paragraph position="0"> Some questions that will now be addressed in looking at e-rater system performance on nonnative speaker essay data are: (1) How does performance for nonnative speakers on TWE compare with performance in operational sconng? (2) How does the system's agreement with human readers differ for each of the language groups in this 3 To date, this training sample composition has given us the best cross-validation results. Some previous studies experimenting with smaller training samples with this fairly flat distribution, or samples which reflect more directly the natural distribution of the data at each score point have shown lower performance in scoring cross-validation sets of 500 - 900 essays.</Paragraph>
      <Paragraph position="1">  study? (3) How does e-rater's agreement with human readers differ for the nonnative speaker language groups as compared to the English speaking language groups? (4) Is there a significant difference between the features used most often in models for operational prompts as compared to the TWE prompts?</Paragraph>
    </Section>
    <Section position="2" start_page="71" end_page="71" type="sub_section">
      <SectionTitle>
3.1 Data sample
</SectionTitle>
      <Paragraph position="0"> For this study, two prompts from the Test of Written English were used. These prompts (TWE1 and TWE2) ask candidates to read and think about a statement, and then to agree or disagree with the statement, and to give reasons to support the opinion given by the candidate. The scoring guides for these essays have a 6-point scale, where a &amp;quot;6&amp;quot; is the highest score and a &amp;quot;1&amp;quot; is the lowest score. They are holistic guides, though the criteria are more generally stated than in the scoring guides used to build e-rater.</Paragraph>
      <Paragraph position="1"> For each of the prompts a total of 255 essays were used for training. Fifty training essays were randomly selected from each of the score categories 2-6. Because of the small number of essays with a score of 1, only five l's were included in each training set. The remainder of the essays were used for cross-validation purposes.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML