File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0206_metho.xml
Size: 24,586 bytes
Last Modified: 2025-10-06 14:09:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0206"> <Title>Automatic Essay Grading with Probabilistic Latent Semantic Analysis</Title> <Section position="3" start_page="29" end_page="33" type="metho"> <SectionTitle> 2 AEA System </SectionTitle> <Paragraph position="0"> We have developed a system for automated assessment of essays (Kakkonen et al., 2004; Kakkonen and Sutinen, 2004). In this section, we explain the basic architecture of the system and describe the methods used to analyze essays.</Paragraph> <Section position="1" start_page="29" end_page="30" type="sub_section"> <SectionTitle> 2.1 Architecture of AEA </SectionTitle> <Paragraph position="0"> There are two approaches commonly used in the essay grading systems to determine the grade for the essay: 1. The essay to be graded is compared to the human-graded essays and the grade is based on the most similar essays' grades; or 2. The essay to be graded is compared to the essay topic related materials (e.g. textbook or model essays) and the grade is given based on the similarity to these materials.</Paragraph> <Paragraph position="1"> In our system, AEA (Kakkonen and Sutinen, 2004), we have combined these two approaches. The relevant parts of the learning materials, such as chapters of a textbook, are used to train the system with assignment-specific knowledge. The approaches based on the comparison between the essays to be graded and the textbook have been introduced in (Landauer et al., 1997; Foltz et al., 1999a; Lemaire and Dessus, 2001; Hearst et al., 2000), but have been usually found less accurate than the methods based on comparison to prescored essays. Our method attempts to overcome this by combining the use of course content and prescored essays. The essays to be graded are not directly compared to the prescored essays with for instance k-nearest neighbors method, but prescored essays are used to determine the similarity threshold values for grade categories as discussed below. Prescored essays can also be used to determine the optimal dimension for the reduced matrix in LSA as discussed in Kakkonen et al. (2005).</Paragraph> <Paragraph position="2"> tem. The texts to be analyzed are added into wordby-context matrix (WCM), representing the number of occurrences of each unique word in each of the contexts (e.g. documents, paragraphs or sentences). In WCM M, cell Mij contains the count of the word i occurrences in the context j. As the first step in analyzing the essays and course materials, the lemma of each word form occurring in the texts must be found. We have so far applied AEA only to essays written in Finnish. Finnish is morphologically more complex than English, and word forms are formed by adding suffixes into base forms. Because of that, base forms have to be used instead of inflectional forms when building the WCM, especially if a relatively small corpus is utilized. Furthermore, several words can become synonyms when suffixes are added to them, thus making the word sense disambiguation necessary. Hence, instead of just stripping suffixes, we apply a more sophisticated method, a morphological parser and disambiguator, namely Constraint Grammar parser for Finnish (FINCG) to produce the lemmas for each word (Lingsoft, 2005). In addition, the most commonly occurring words (stopwords) are not included in the matrix, and only the words that appear in at least two contexts are added into the WCM (Landauer et al., 1998). We also apply entropy-based term weighting in order to give higher values to words that are more important for the content and lower values to words with less importance.</Paragraph> <Paragraph position="3"> First, the comparison materials based on the relevant textbook passages or other course materials are modified into machine readable form with the method described in the previous paragraph. The vector for each context in the comparison materials is marked with Yi. This WCM is used to create the model with LSA, PLSA or another information retrieval method. To compare the similarity of an essay to the course materials, a query vector Xj of the same form as the vectors in the WCM is constructed. The query vector Xj representing an essay is added or folded in into the model build with WCM with the method specific way discussed later. This foldedin query ~Xj is then compared to the model of each text passage ~Yi in the comparison material by using a similarity measure to determine the similarity value. We have used the cosine of the angle between ( ~Xj, ~Yi), to measure the similarity of two documents. The similarity score for an essay is calculated as the sum of the similarities between the essay and each of the textbook passages.</Paragraph> <Paragraph position="4"> The document vectors of manually graded essays are compared to the textbook passages, in order to determine the similarity scores between the essays and the course materials. Based on these measures, threshold values for the grade categories are defined as follows: the grade categories, g1,g2,...,gC, are associated with similarity value limits, l1,l2,...,lC+1, where C is the number of grades, and lC+1 = [?] and normally l1 = 0 or [?][?]. Other category limits li,2 [?] i [?] C, are defined as weighted averages of the similarity scores for essays belonging to grade categories gi and gi[?]1. Other kinds of formulas to define the grade category limits can be also used.</Paragraph> <Paragraph position="5"> The grade for each essay to be graded is then determined by calculating the similarity score between the essay and the textbook passages and comparing the similarity score to the threshold values defined in the previous phase. The similarity score Si of an essay di is matched to the grade categories according to their limits in order to determine the correct grade category as follows: For each i, 1 [?] i [?] C, if li < Si [?] li+1 then di [?] gi and break.</Paragraph> </Section> <Section position="2" start_page="30" end_page="31" type="sub_section"> <SectionTitle> 2.2 Latent Semantic Analysis Latent Semantic Analysis (LSA) (Landauer et al., </SectionTitle> <Paragraph position="0"> 1998) is a corpus-based method used in information retrieval with vector space models. It provides a means of comparing the semantic similarity between the source and target texts. LSA has been successfully applied to automate giving grades and feedback on free-text responses in several systems as discussed in Section 1. The basic assumption behind LSA is that there is a close relationship between the meaning of a text and the words in that text. The power of LSA lies in the fact that it is able to map the essays with similar wordings closer to each other in the vector space. The LSA method is able to strengthen the similarity between two texts even when they do not contain common words. We describe briefly the technical details of the method.</Paragraph> <Paragraph position="1"> The essence of LSA is dimension reduction based on the singular value decomposition (SVD), an algebraic technique. SVD is a form of factor analysis, which reduces the dimensionality of the original WCM and thereby increases the dependency between contexts and words (Landauer et al., 1998).</Paragraph> <Paragraph position="2"> SVD is defined as X = T0S0D0T, where X is the preprocessed WCM and T0 and D0 are orthonormal matrices representing the words and the contexts. S0 is a diagonal matrix with singular values. In the dimension reduction, the k highest singular values in S0 are selected and the rest are ignored. With this operation, an approximation matrix ~X of the original matrix X is acquired. The aim of the dimension reduction is to reduce &quot;noise&quot; or unimportant details and to allow the underlying semantic structure to be- null come evident (Deerwester et al., 1990).</Paragraph> <Paragraph position="3"> In information retrieval and essay grading, the queries or essays have to be folded in into the model in order to calculate the similarities between the documents in the model and the query. In LSA, the folding in can be achieved with a simple matrix multiplication: ~Xq = XTq T0S[?]10 , where Xq is the term vector constructed from the query document with preprocessing, and T0 and S0 are the matrices from the SVD of the model after dimension reduction. The resulting vector ~Xq is in the same format as the documents in the model.</Paragraph> <Paragraph position="4"> The features that make LSA suitable for automated grading of essays can be summarized as follows. First, the method focuses on the content of the essay, not on the surface features or keyword-based content analysis. The second advantage is that LSA-based scoring can be performed with relatively low amount of human graded essays. Other methods, such as PEG and e-rater typically need several hundred essays to be able to form an assignment-specific model (Shermis et al., 2001; Burstein and Marcu, 2000) whereas LSA-based IEA system has sometimes been calibrated with as few as 20 essays, though it typically needs more essays (Hearst et al., 2000).</Paragraph> <Paragraph position="5"> Although LSA has been successfully applied in information retrieval and related fields, it has also received criticism (Hofmann, 2001; Blei et al., 2003).</Paragraph> <Paragraph position="6"> The objective function determining the optimal decomposition in LSA is the Frobenius norm. This corresponds to an implicit additive Gaussian noise assumption on the counts and may be inadequate.</Paragraph> <Paragraph position="7"> This seems to be acceptable with small document collections but with large document collections it might have a negative effect. LSA does not define a properly normalized probability distribution and, even worse, the approximation matrix may contain negative entries meaning that a document contains negative number of certain words after the dimension reduction. Hence, it is impossible to treat LSA as a generative language model and moreover, the use of different similarity measures is limited. Furthermore, there is no obvious interpretation of the directions in the latent semantic space. This might have an effect if also feedback is given. Choosing the number of dimensions in LSA is typically based on an ad hoc heuristics. However, there is research done aiming to resolve the problem of dimension selection in LSA, especially in the essay grading domain (Kakkonen et al., 2005).</Paragraph> </Section> <Section position="3" start_page="31" end_page="32" type="sub_section"> <SectionTitle> 2.3 Probabilistic Latent Semantic Analysis </SectionTitle> <Paragraph position="0"> (Hofmann, 2001) is based on a statistical model which has been called the aspect model. The aspect model is a latent variable model for co-occurrence data, which associates unobserved class variables zk, k [?] {1,2,...,K} with each observation. In our settings, the observation is an occurrence of a word wj, j [?] {1,2,...,M}, in a particular context di, i [?] {1,2,...,N}. The probabilities related to this model are defined as follows: * P(di) denotes the probability that a word occurrence will be observed in a particular context di; * P(wj|zk) denotes the class-conditional probability of a specific word conditioned on the unobserved class variable zk; and * P(zk|di) denotes a context specific probability distribution over the latent variable space.</Paragraph> <Paragraph position="1"> When using PLSA in essay grading or information retrieval, the first goal is to build up the model. In other words, approximate the probability mass functions with machine learning from the training data, in our case the comparison material consisting of assignment specific texts.</Paragraph> <Paragraph position="2"> Expectation Maximization (EM) algorithm can be used in the model building with maximum likelihood formulation of the learning task (Dempster et al., 1977). In EM, the algorithm alternates between two steps: (i) an expectation (E) step where posterior probabilities are computed for the latent variables, based on the current estimates of the parameters, (ii) a maximization (M) step, where parameters are updated based on the loglikelihood which depends on the posterior probabilities computed in the E-step. The standard E-step is defined in equation (1).</Paragraph> <Paragraph position="4"> The M-step is formulated in equations (2) and (3) as derived by Hofmann (2001). These two steps are alternated until a termination condition is met, in this case, when the maximum likelihood function has converged.</Paragraph> <Paragraph position="6"> Although standard EM algorithm can lead to good results, it may also overfit the model to the training data and perform poorly with unseen data. Furthermore, the algorithm is iterative and converges slowly, which can increase the runtime seriously.</Paragraph> <Paragraph position="7"> Hence, Hofmann (2001) proposes another approach called Tempered EM (TEM), which is a derivation of standard EM algorithm. In TEM, the M-step is the same as in EM, but a dampening parameter is introduced into the E-step as shown in equation (4). The parameter b will dampen the posterior probabilities closer to uniform distribution, when b < 1 and form the standard E-step when b = 1.</Paragraph> <Paragraph position="9"> Hofmann (2001) defines the TEM algorithm as follows: 1. Set b := 1 and perform the standard EM with early stopping.</Paragraph> <Paragraph position="10"> 2. Set b := eb (with e < 1). 3. Repeat the E- and M-steps until the performance on hold-out data deteriorates, otherwise go to step 2.</Paragraph> <Paragraph position="11"> 4. Stop the iteration when decreasing b does not improve performance on hold-out data. Early stopping means that the optimization is not done until the model converges, but the iteration is stopped already once the performance on hold-out data degenerates. Hofmann (2001) proposes to use the perplexity to measure the generalization performance of the model and the stopping condition for the early stopping. The perplexity is defined as the log-averaged inverse probability on unseen data calculated as in equation (5).</Paragraph> <Paragraph position="13"> where n'(di,wj) is the count on hold-out or training data.</Paragraph> <Paragraph position="14"> In PLSA, the folding in is done by using TEM as well. The only difference when folding in a new document or query q outside the model is that just the probabilities P(zk|q) are updated during the M-step and the P(wj|zk) are kept as they are. The similarities between a document di in the model and a query q folded in to the model can be calculated with the cosine of the angle between the vectors containing the probability distributions (P(zk|q))Kk=1 and (P(zk|di))Kk=1 (Hofmann, 2001).</Paragraph> <Paragraph position="15"> PLSA, unlike LSA, defines proper probability distributions to the documents and has its basis in Statistics. It belongs to a framework called Latent</Paragraph> </Section> <Section position="4" start_page="32" end_page="33" type="sub_section"> <SectionTitle> Dirichlet Allocations (Girolami and Kab'an, 2003; </SectionTitle> <Paragraph position="0"> Blei et al., 2003), which gives a better grounding for this method. For instance, several probabilistic similarity measures can be used. PLSA is interpretable with its generative model, latent classes and illustrations in N-dimensional space (Hofmann, 2001).</Paragraph> <Paragraph position="1"> The latent classes or topics can be used to determine which part of the comparison materials the student has answered and which ones not.</Paragraph> <Paragraph position="2"> In empirical research conducted by Hofmann (2001), PLSA yielded equal or better results compared to LSA in the contexts of information retrieval. It was also shown that the accuracy of PLSA can increase when the number of latent variables is increased. Furthermore, the combination of several similarity scores (e.g. cosines of angles between two documents) from models with different number of latent variables also increases the overall accuracy. Therefore, the selection of the dimension is not as crucial as in LSA. The problem with PLSA is that the algorithm used to computate the model, EM or its variant, is probabilistic and can converge to a local maximum. However, according to Hofmann (2001), this is not a problem since the differences between separate runs are small. Flaws in the generative model and the overfitting problem handout with teacher's comments included and transparencies represented to the students. have been discussed in Blei et al. (2003).</Paragraph> </Section> </Section> <Section position="4" start_page="33" end_page="34" type="metho"> <SectionTitle> 3 Experiment </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="33" end_page="33" type="sub_section"> <SectionTitle> 3.1 Procedure and Materials </SectionTitle> <Paragraph position="0"> To analyze the performance of LSA and PLSA in the essay assessment, we performed an experiment using three essay sets collected from courses on education, marketing and software engineering. The information about the essay collections is shown in Table 1. Comparison materials were taken either from the course book or other course materials and selected by the lecturer of the course. Furthermore, the comparison materials used in each of these sets were divided with two methods, either into paragraphs or sentences. Thus, we run the experiment in total with six different configurations of materials.</Paragraph> <Paragraph position="1"> We used our implementations of LSA and PLSA methods as described in Section 2. With LSA, all the possible dimensions (i.e. from two to the number of passages in the comparison materials) were searched in order to find the dimension achieving the highest accuracy of scoring, measured as the correlation between the grades given by the system and the human assessor. There is no upper limit for the number of latent variables in PLSA models as there is for the dimensions in LSA. Thus, we applied the same range for the best dimension search to be fair in the comparison. Furthermore, a linear combination of similarity values from PLSA models (PLSA-C) with predefined numbers of latent variables K [?] {16,32,48,64,80,96,112,128} was used just to analyze the proposed potential of the method as discussed in Section 2.3 and in (Hofmann, 2001). When building up all the PLSA models with TEM, we used 20 essays from the training set of the essay collections to determine the early stopping condition with perplexity of the model on unseen data as proposed by Hofmann (2001).</Paragraph> </Section> <Section position="2" start_page="33" end_page="34" type="sub_section"> <SectionTitle> 3.2 Results and Discussion </SectionTitle> <Paragraph position="0"> The results of the experiment for all the three methods, LSA, PLSA and PLSA-C are shown in Table 2.</Paragraph> <Paragraph position="1"> It contains the most accurate dimension (column dim.) measured by machine-human correlation in grading, the percentage of the same (same) and adjacent grades (adj.) compared to the human grader and the Spearman correlation (cor.) between the grades given by the human assessor and the system.</Paragraph> <Paragraph position="2"> The results indicate that LSA outperforms both methods using PLSA. This is opposite to the results obtained by Hofmann (2001) in information retrieval. We believe this is due to the size of the document collection used to build up the model. In the experiments of Hofmann (2001), it was much larger, 1000 to 3000 documents, while in our case the number of documents was between 25 and 150.</Paragraph> <Paragraph position="3"> However, the differences are quite small when using the comparison materials divided into sentences. Although all methods seem to be more accurate when the comparison materials are divided into sentences, PLSA based methods seem to gain more than LSA.</Paragraph> <Paragraph position="4"> In most cases, PLSA with the most accurate dimension and PLSA-C perform almost equally.</Paragraph> <Paragraph position="5"> This is also in contrast with the findings of Hofmann (2001) because in his experiments PLSA-C performed better than PLSA. This is probably also due to the small document sets used. Nevertheless, this means that finding the most accurate dimension is unnecessary, but it is enough to com- null bine several dimensions' similarity values. In our case, it seems that linear combination of the similarity values is not the best option because the similarity values between essays and comparison materials decrease when the number of latent variables increases. A topic for a further study would be to analyze techniques to combine the similarity values in PLSA-C to obtain higher accuracy in essay grading. Furthermore, it seems that the best combination of dimensions in PLSA-C depends on the features of the document collection (e.g. number of passages in comparison materials or number of essays) used. Another topic of further research is how the combination of dimensions can be optimized for each essay set by using the collection specific features without the validation procedure proposed in Kakkonen et al. (2005).</Paragraph> <Paragraph position="6"> Currently, we have not implemented a version of LSA that combines scores from several models but we will analyze the possibilities for that in future research. Nevertheless, LSA representations for different dimensions form a nested sequence because of the number of singular values taken to approximate the original matrix. This will make the model combination less effective with LSA. This is not true for statistical models, such as PLSA, because they can capture a larger variety of the possible decompositions and thus several models can actually complement each other (Hofmann, 2001).</Paragraph> </Section> </Section> <Section position="5" start_page="34" end_page="35" type="metho"> <SectionTitle> 4 Future Work and Conclusion </SectionTitle> <Paragraph position="0"> We have implemented a system to assess essays written in Finnish. In this paper, we report a new extension to the system for analyzing the essays with PLSA method. We have compared LSA and PLSA as methods for essay grading. When our results are compared to the correlations between human and system grades reported in literature, we have achieved promising results with all methods.</Paragraph> <Paragraph position="1"> LSA was slightly better when compared to PLSA-based methods. As future research, we are going to analyze if there are better methods to combine the similarity scores from several models in the context of essay grading to increase the accuracy (Hofmann, 2001). Another interesting topic is to combine LSA and PLSA to compliment each other.</Paragraph> <Paragraph position="2"> We used the cosine of the angle between the probability vectors as a measure of similarity in LSA and PLSA. Other methods are proposed to determine the similarities between probability distributions produced by PLSA (Girolami and Kab'an, 2003; Blei et al., 2003). The effects of using these techniques will be compared in the future experiments.</Paragraph> <Paragraph position="3"> If the PLSA models with different numbers of latent variables are not highly dependent on each other, this would allow us to analyze the reliability of the grades given by the system. This is not possible with LSA based methods as they are normally highly dependent on each other. However, this will need further work to examine all the potentials.</Paragraph> <Paragraph position="4"> Our future aim is to develop a semi-automatic essay assessment system (Kakkonen et al., 2004).</Paragraph> <Paragraph position="5"> For determining the grades or giving feedback to the student, the system needs a method for comparing similarities between the texts. LSA and PLSA offer a feasible solution for the purpose. In order to achieve even more accurate grading, we can use some of the results and techniques developed for LSA and develop them further for both methods. We are currently working with an extension to our LSA model that uses standard validation methods for reducing automatically the irrelevant content informa- null tion in LSA-based essay grading (Kakkonen et al., 2005). In addition, we plan to continue the work with PLSA, since it, being a probabilistic model, introduces new possibilities, for instance, in similarity comparison and feedback giving.</Paragraph> </Section> class="xml-element"></Paper>