XML Viewer - p05-1065

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1065_metho.xml
Size: 13,594 bytes
Last Modified: 2025-10-06 14:09:47
<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1065">
  <Title>Reading Level Assessment Using Support Vector Machines and Statistical Language Models</Title>
  <Section position="3" start_page="523" end_page="523" type="metho">
    <SectionTitle>
2 Reading Level Assessment
</SectionTitle>
    <Paragraph position="0"> This section highlights examples and features of some commonly used measures of reading level and discusses current research on the topic of reading level assessment using NLP techniques.</Paragraph>
    <Paragraph position="1"> Many traditional methods of reading level assessment focus on simple approximations of syntactic complexity such as sentence length. The widely-used Flesch-Kincaid Grade Level index is based on the average number of syllables per word and the average sentence length in a passage of text (Kincaid et al., 1975) (as cited in (Collins-Thompson and Callan, 2004)). Similarly, the Gunning Fog index is based on the average number of words per sentence and the percentage of words with three or more syllables (Gunning, 1952). These methods are quick and easy to calculate but have drawbacks: sentence length is not an accurate measure of syntactic complexity, and syllable count does not necessarily indicate the dif culty of a word. Additionally, a student may be familiar with a few complex words (e.g. dinosaur names) but unable to understand complex syntactic constructions.</Paragraph>
    <Paragraph position="2"> Other measures of readability focus on semantics, which is usually approximated by word frequency with respect to a reference list or corpus.</Paragraph>
    <Paragraph position="3"> The Dale-Chall formula uses a combination of average sentence length and percentage of words not on a list of 3000 easy words (Chall and Dale, 1995). The Lexile framework combines measures of semantics, represented by word frequency counts, and syntax, represented by sentence length (Stenner, 1996). These measures are inadequate for our task; in many cases, teachers want materials with more dif cult, topic-speci c words but simple structure.</Paragraph>
    <Paragraph position="4"> Measures of reading level based on word lists do not capture this information.</Paragraph>
    <Paragraph position="5"> In addition to the traditional reading level metrics, researchers at Carnegie Mellon University have applied probabilistic language modeling techniques to this task. Si and Callan (2001) conducted preliminary work to classify science web pages using uni-gram models. More recently, Collins-Thompson and Callan manually collected a corpus of web pages ranked by grade level and observed that vocabulary words are not distributed evenly across grade levels. They developed a smoothed unigram classi er to better capture the variance in word usage across grade levels (Collins-Thompson and Callan, 2004). On web text, their classi er outperformed several other measures of semantic dif culty: the fraction of unknown words in the text, the number of distinct types per 100 token passage, the mean log frequency of the text relative to a large corpus, and the Flesch-Kincaid measure. The traditional measures performed better on some commercial corpora, but these corpora were calibrated using similar measures, so this is not a fair comparison. More importantly, the smoothed unigram measure worked better on the web corpus, especially on short passages. The smoothed unigram classi er is also more generalizable, since it can be trained on any collection of data. Traditional measures such as Dale-Chall and Lexile are based on static word lists.</Paragraph>
    <Paragraph position="6"> Although the smoothed unigram classi er outperforms other vocabulary-based semantic measures, it does not capture syntactic information. We believe that higher order n-gram models or class n-gram models can achieve better performance by capturing both semantic and syntactic information. This is particularly important for the tasks we are interested in, when the vocabulary (i.e. topic) and grade level are not necessarily well-matched.</Paragraph>
  </Section>
  <Section position="4" start_page="523" end_page="524" type="metho">
    <SectionTitle>
3 Corpora
</SectionTitle>
    <Paragraph position="0"> Our work is currently focused on a corpus obtained from Weekly Reader, an educational newspaper with versions targeted at different grade levels (Weekly Reader, 2004). These data include a variety of labeled non- ction topics, including science, history, and current events. Our corpus consists of articles from the second, third, fourth, and fth grade edi- null tions of the newspaper. We design classi ers to distinguish each of these four categories. This corpus contains just under 2400 articles, distributed as shown in Table 1.</Paragraph>
    <Paragraph position="1"> Additionally, we have two corpora consisting of articles for adults and corresponding simpli ed versions for children or other language learners. Barzilay and Elhadad (2003) have allowed us to use their corpus from Encyclopedia Britannica, which contains articles from the full version of the encyclopedia and corresponding articles from Britannica Elementary, a new version targeted at children. The Western/Paci c Literacy Network's (2004) web site has an archive of CNN news stories and abridged versions which we have also received permission to use. Although these corpora do not provide an explicit grade-level ranking for each article, broad categories are distinguished. We use these data as a supplement to the Weekly Reader corpus for learning models to distinguish broad reading level classes than can serve to provide features for more detailed classi cation. Table 2 shows the size of the supplemental corpora.</Paragraph>
  </Section>
  <Section position="5" start_page="524" end_page="526" type="metho">
    <SectionTitle>
4 Approach
</SectionTitle>
    <Paragraph position="0"> Existing reading level measures are inadequate due to their reliance on vocabulary lists and/or a super cial representation of syntax. Our approach uses n-gram language models as a low-cost automatic approximation of both syntactic and semantic analysis. Statistical language models (LMs) are used successfully in this way in other areas of NLP such as speech recognition and machine translation. We also use a standard statistical parser (Charniak, 2000) to provide syntactic analysis.</Paragraph>
    <Paragraph position="1"> In practice, a teacher is likely to be looking for texts at a particular level rather than classifying a group of texts into a variety of categories. Thus we construct one classi er per category which decides whether a document belongs in that category or not, rather than constructing a classi er which ranks documents into different categories relative to each other.</Paragraph>
    <Section position="1" start_page="524" end_page="525" type="sub_section">
      <SectionTitle>
4.1 Statistical Language Models
</SectionTitle>
      <Paragraph position="0"> Statistical LMs predict the probability that a particular word sequence will occur. The most commonly used statistical language model is the n-gram model, which assumes that the word sequence is an (n[?]1)th order Markov process. For example, for the common trigram model where n = 3, the probability of sequence w is:</Paragraph>
      <Paragraph position="2"> The parameters of the model are estimated using a maximum likelihood estimate based on the observed frequency in a training corpus and smoothed using modi ed Kneser-Ney smoothing (Chen and Goodman, 1999). We used the SRI Language Modeling Toolkit (Stolcke, 2002) for language model training.</Paragraph>
      <Paragraph position="3"> Our rst set of classi ers consists of one n-gram language model per class c in the set of possible classes C. For each text document t, we can calculate the likelihood ratio between the probability given by the model for class c and the probabilities given by the other models for the other classes:</Paragraph>
      <Paragraph position="5"> (2) where we assume uniform prior probabilities P(c). The resulting value can be compared to an empirically chosen threshold to determine if the document is in class c or not. For each class c, a language model is estimated from a corpus of training texts.  In addition to using the likelihood ratio for classication, we can use scores from language models as features in another classi er (e.g. an SVM). For example, perplexity (PP) is an information-theoretic measure often used to assess language models:</Paragraph>
      <Paragraph position="7"> where H(t|c) is the entropy relative to class c of a length m word sequence t = w1, ..., wm, de ned as</Paragraph>
      <Paragraph position="9"> Low perplexity indicates a better match between the test data and the model, corresponding to a higher probability P(t|c). Perplexity scores are used as features in the SVM model described in Section 4.3.</Paragraph>
      <Paragraph position="10"> The likelihood ratio described above could also be used as a feature, but we achieved better results using perplexity.</Paragraph>
    </Section>
    <Section position="2" start_page="525" end_page="525" type="sub_section">
      <SectionTitle>
4.2 Feature Selection
</SectionTitle>
      <Paragraph position="0"> Feature selection is a common part of classi er design for many classi cation problems; however, there are mixed results in the literature on feature selection for text classi cation tasks. In Collins-Thompson and Callan's work (2004) on readability assessment, LM smoothing techniques are more effective than other forms of explicit feature selection. However, feature selection proves to be important in other text classi cation work, e.g. Lee and Myaeng's (2002) genre and subject detection work and Boulis and Ostendorf's (2005) work on feature selection for topic classi cation.</Paragraph>
      <Paragraph position="1"> For our LM classi ers, we followed Boulis and Ostendorf's (2005) approach for feature selection and ranked words by their ability to discriminate between classes. Given P(c|w), the probability of class c given word w, estimated empirically from the training set, we sorted words based on their information gain (IG). Information gain measures the difference in entropy when w is and is not included as a feature.</Paragraph>
      <Paragraph position="3"> The most discriminative words are selected as features by plotting the sorted IG values and keeping only those words below the knee in the curve, as determined by manual inspection of the graph. In an early experiment, we replaced all remaining words with a single unknown tag. This did not result in an effective classi er, so in later experiments the remaining words were replaced with a small set of general tags. Motivated by our goal of representing syntax, we used part-of-speech (POS) tags as labeled by a maximum entropy tagger (Ratnaparkhi, 1996). These tags allow the model to represent patterns in the text at a higher level than that of individual words, using sequences of POS tags to capture rough syntactic information. The resulting vocabulary consisted of 276 words and 56 POS tags.</Paragraph>
    </Section>
    <Section position="3" start_page="525" end_page="526" type="sub_section">
      <SectionTitle>
4.3 Support Vector Machines
</SectionTitle>
      <Paragraph position="0"> Support vector machines (SVMs) are a machine learning technique used in a variety of text classication problems. SVMs are based on the principle of structural risk minimization. Viewing the data as points in a high-dimensional feature space, the goal is to t a hyperplane between the positive and negative examples so as to maximize the distance between the data points and the plane. SVMs were introduced by Vapnik (1995) and were popularized in the area of text classi cation by Joachims (1998a).</Paragraph>
      <Paragraph position="1"> The unit of classi cation in this work is a single article. Our SVM classi ers for reading level use the following features:  (grade 2) 2. For each article, we calculated the percentage of a) all word instances (tokens) and b) all unique words (types) not on these lists, resulting in three token OOV rate features and three type OOV rate features per article.</Paragraph>
      <Paragraph position="2"> The parse features are generated using the Charniak parser (Charniak, 2000) trained on the standard Wall Street Journal Treebank corpus. We chose to use this standard data set as we do not have any domain-speci c treebank data for training a parser.</Paragraph>
      <Paragraph position="3"> Although clearly there is a difference between news text for adults and news articles intended for children, inspection of some of the resulting parses showed good accuracy.</Paragraph>
      <Paragraph position="4"> Ideally, the language model scores would be for LMs from domain-speci c training data (i.e. more Weekly Reader data.) However, our corpus is limited and preliminary experiments in which the training data was split for LM and SVM training were unsuccessful due to the small size of the resulting data sets. Thus we made use of the Britannica and CNN articles to train models of three n-gram orders on child text and adult text. This resulted in 12 LM perplexity features per article based on trigram, bigram and unigram LMs trained on Britannica (adult), Britannica Elementary, CNN (adult) and CNN abridged text.</Paragraph>
      <Paragraph position="5"> For training SVMs, we used the SVMlight toolkit developed by Joachims (1998b). Using development data, we selected the radial basis function kernel and tuned parameters using cross validation and grid search as described in (Hsu et al., 2003).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML