File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1705_intro.xml

Size: 5,806 bytes

Last Modified: 2025-10-06 14:02:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1705">
  <Title>Indexing Student Essays Paragraphs using LSA over an Integrated Ontological Space</Title>
  <Section position="2" start_page="0" end_page="1" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> This paper describes a novel methodology aiming to support evaluators during the essay marking process. This approach allows measuring semantic similarity between structured (i.e.</Paragraph>
    <Paragraph position="1"> ontology and binary relations derived from the essay question) and unstructured (i.e. text processed as a bag of words) information by means of Latent Semantic Analysis (LSA) and the cosine similarity measure (Deewerster et al., 1990).</Paragraph>
    <Paragraph position="2"> Previous studies (Foltz et al., 1998; Wiemer-Hastings and Graesser, 1999) have used LSA to measure text coherence and comprehension by comparing units of text (i.e. sentences, terms or paragraphs) to determine how semantically related they are. The work presented in this paper is based on the use of &amp;quot;pseudo&amp;quot; documents: these are temporary documents containing a description of knowledge entities extracted from available domain ontologies (i.e. ontological relations). Both pseudo documents and paragraphs in student essays are represented as vectors. Essay paragraphs are indexed according to a measure of semantic similarity (called cosine similarity). The ontological space acts as a mediated schema, a set of virtual relations among knowledge entities related by their degree of similarity. A new knowledge entity can be added in this space and automatically a similarity measure is calculated for all the entities within the space.</Paragraph>
    <Section position="1" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
1.1 Motivation and Context
</SectionTitle>
      <Paragraph position="0"> The main motivation for this work derives from a need for semantics in essay evaluation, whether by a tutor or by the student author in the process of writing. Page (Page, 1968) makes a useful distinction between marking for syntax (i.e.</Paragraph>
      <Paragraph position="1"> linguistic style) and for content (subject matter) which we will use in our outline. Based on this distinction, four main approaches to essay assessment have been reported (Williams, 2001).</Paragraph>
      <Paragraph position="2"> Early systems such as PEG (Page, 1966) relied mainly on syntactic and linguistic features and required a sample of the essays to be marked by a number of human judges. E-rater (Burstein et al., 1998) uses a combination of statistical and natural language processing techniques for the purpose of extracting linguistic features of the essays to be graded. Again, the essays are evaluated against a set of human-graded essays acting as a benchmark.</Paragraph>
      <Paragraph position="3"> In the LSA method of essay grading, an LSA space is constructed based on domain specific material and the student essays. LSA grading performance is about as reliable as human graders (Foltz, 1996).</Paragraph>
      <Paragraph position="4"> Text categorisation (Larkey, 1998) also requires a database of graded essays, so that new essays can be categorised in relation to them.</Paragraph>
      <Paragraph position="5"> In short, the approaches seen so far  have either concentrated on syntactic and linguistic features or used domain knowledge in the form of keywords and documents about the domain. What we are proposing in this paper is that a further distinction should be made between using implicit (keywords, documents) and explicit content representations (see Fig.1, our contribution is marked in bold). We  Kukich presents in her article Beyond Automated Essay Scoring a time line of research developments in the field of writing evaluation (Kukich, 2000).</Paragraph>
      <Paragraph position="6"> then argue the case for adding explicit domain knowledge in the form of domain ontologies. In particular, we merge ontologies, LSA and FOL. An advantage of this approach is that it does not require a corpus of graded essays, except for validation. This feature enables tutors (or students in need of feedback) to evaluate essays on particular topics even when there are no pre-scored essay examples available. Effectively, this capability may reduce the overall time required to prepare a reliable evaluation scheme for a new essay question.</Paragraph>
      <Paragraph position="7"> Figure 1 - Grading Criteria for Student Essays 2 LSA and the Cosine Similarity In the vector space model, a term-to-document matrix is built in which the entries are weighted frequencies of pre-processed terms occurring in a collection of documents. Dimension reduction methods (such as LSA), when applied to the semantic vector space model, improve information retrieval, information filtering and word sense disambiguation. The reduction in dimensions reduces the noise in text categorisation, reduces the computational complexity of cluster creation, and produces the best statistical approximation to the original vector space model. Likelihood curves characterise with a quantity the level of significance of the reduced model dimensions.</Paragraph>
      <Paragraph position="8"> Also, the significance of each dimension follows a Zipf distribution (Li, 1992) indicating that the reduced model dimensions represent latent concepts (Ding, 1999). The dimensions in the reduced vector space model can be compared measuring semantic similarity between each of them by means of the cosine similarity. The cosine of the angle between two vectors is defined as the inner product between the vectors v and w divided by the product of the length of the two vectors.</Paragraph>
      <Paragraph position="10"/>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML