XML Viewer - p05-1065

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/p05-1065_evalu.xml
Size: 8,308 bytes
Last Modified: 2025-10-06 13:59:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1065">
  <Title>Reading Level Assessment Using Support Vector Machines and Statistical Language Models</Title>
  <Section position="6" start_page="526" end_page="528" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="526" end_page="527" type="sub_section">
      <SectionTitle>
5.1 Test Data and Evaluation Criteria
</SectionTitle>
      <Paragraph position="0"> We divide the Weekly Reader corpus described in Section 3 into separate training, development, and test sets. The number of articles in each set is shown in Table 3. The development data is used as a test set for comparing classi ers, tuning parameters, etc, and the results presented in this section are based on the test set.</Paragraph>
      <Paragraph position="1"> We present results in three different formats. For  corpus as divided into training, development and test sets. The dev and test sets are the same size and each consist of approximately 5% of the data for each grade level.</Paragraph>
      <Paragraph position="2"> measures. For comparison to other methods, e.g. Flesch-Kincaid and Lexile, which are not binary classi ers, we consider the percentage of articles which are misclassi ed by more than one grade level.</Paragraph>
      <Paragraph position="3"> Detection Error Tradeoff curves show the tradeoff between misses and false alarms for different threshold values for the classi ers. Misses are positive examples of a class that are misclassi ed as negative examples; false alarms are negative examples misclassi ed as positive. DET curves have been used in other detection tasks in language processing, e.g. Martin et al. (1997). We use these curves to visualize the tradeoff between the two types of errors, and select the minimum cost operating point in order to get a threshold for precision and recall calculations. The minimum cost operating point depends on the relative costs of misses and false alarms; it is conceivable that one type of error might be more serious than the other. After consultation with teachers (future users of our system), we concluded that there are pros and cons to each side, so for the purpose of this analysis we weighted the two types of errors equally. In this work, the minimum cost operating point is selected by averaging the percentages of misses and false alarms at each point and choosing the point with the lowest average. Unless otherwise noted, errors reported are associated with these actual operating points, which may not lie on the convex hull of the DET curve.</Paragraph>
      <Paragraph position="4"> Precision and recall are often used to assess information retrieval systems, and our task is similar. Precision indicates the percentage of the retrieved documents that are relevant, in this case the percentage of detected documents that match the target  grade level. Recall indicates the percentage of the total number of relevant documents in the data set that are retrieved, in this case the percentage of the total number of documents from the target level that are detected.</Paragraph>
    </Section>
    <Section position="2" start_page="527" end_page="527" type="sub_section">
      <SectionTitle>
5.2 Language Model Classi er
</SectionTitle>
      <Paragraph position="0"> based classi ers. The minimum cost error rates for these classi ers, indicated by large dots in the plot, are in the range of 33-43%, with only one over 40%.</Paragraph>
      <Paragraph position="1"> The curves for bigram and unigram models have similar shapes, but the trigram models outperform the lower-order models. Error rates for the bigram models range from 37-45% and the unigram models have error rates in the 39-49% range, with all but one over 40%. Although our training corpus is small the feature selection described in Section 4.2 allows us to use these higher-order trigram models.</Paragraph>
    </Section>
    <Section position="3" start_page="527" end_page="528" type="sub_section">
      <SectionTitle>
5.3 Support Vector Machine Classi er
</SectionTitle>
      <Paragraph position="0"> By combining language model scores with other features in an SVM framework, we achieve our best results. Figures 2 and 3 show DET curves for this set of classi ers on the development set and test set, respectively. The grade 2 and 5 classi ers have the best performance, probably because grade 3 and 4 must be distinguished from other classes at both higher and lower levels. Using threshold values selected based on minimum cost on the development  set, indicated by large dots on the plot, we calculated precision and recall on the test set. Results are presented in Table 4. The grade 3 classi er has high recall but relatively low precision; the grade 4 classier does better on precision and reasonably well on recall. Since the minimum cost operating points do not correspond to the equal error rate (i.e. equal percentage of misses and false alarms) there is variation in the precision-recall tradeoff for the different grade level classi ers. For example, for class 3, the operating point corresponds to a high probability of false alarms and a lower probability of misses, which results in low precision and high recall. For operating points chosen on the convex hull of the DET curves, the equal error rate ranges from 12-25% for the dif- null ed by more than one grade level.</Paragraph>
      <Paragraph position="1"> ferent grade levels.</Paragraph>
      <Paragraph position="2"> We investigated the contribution of individual features to the overall performance of the SVM classi er and found that no features stood out as most important, and performance was degraded when any particular features were removed.</Paragraph>
    </Section>
    <Section position="4" start_page="528" end_page="528" type="sub_section">
      <SectionTitle>
5.4 Comparison
</SectionTitle>
      <Paragraph position="0"> We also compared error rates for the best performing SVM classi er with two traditional reading level measures, Flesch-Kincaid and Lexile. The Flesch-Kincaid Grade Level index is a commonly used measure of reading level based on the average number of syllables per word and average sentence length. The Flesch-Kincaid score for a document is intended to directly correspond with its grade level.</Paragraph>
      <Paragraph position="1"> We chose the Lexile measure as an example of a reading level classi er based on word lists.3 Lexile scores do not correlate directly to numeric grade levels, however a mapping of ranges of Lexile scores to their corresponding grade levels is available on the Lexile web site (Lexile, 2005).</Paragraph>
      <Paragraph position="2"> For each of these three classi ers, Table 5 shows the percentage of articles which are misclassi ed by more than one grade level. Flesch-Kincaid performs poorly, as expected since its only features are sen3Other classi ers such as Dale-Chall do not have automatic software available.</Paragraph>
      <Paragraph position="3"> tence length and average syllable count. Although this index is commonly used, perhaps due to its simplicity, it is not accurate enough for the intended application. Our SVM classi er also outperforms the Lexile metric. Lexile is a more general measure while our classi er is trained on this particular domain, so the better performance of our model is not entirely surprising. Importantly, however, our classi er is easily tuned to any corpus of interest.</Paragraph>
      <Paragraph position="4"> To test our classi er on data outside the Weekly Reader corpus, we downloaded 10 randomly selected newspaper articles from the Kidspost edition of The Washington Post (2005). Kidspost is intended for grades 3-8. We found that our SVM classi er, trained on the Weekly Reader corpus, classi ed four of these articles as grade 4 and seven articles as grade 5 (with one overlap with grade 4).</Paragraph>
      <Paragraph position="5"> These results indicate that our classi er can generalize to other data sets. Since there was no training data corresponding to higher reading levels, the best performance we can expect for adult-level newspaper articles is for our classi ers to mark them as the highest grade level, which is indeed what happened for 10 randomly chosen articles from standard edition of The Washington Post.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML