XML Viewer - n04-1024

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-1024_evalu.xml
Size: 13,858 bytes
Last Modified: 2025-10-06 13:59:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1024">
  <Title>High Low Total Precision Recall F-measure Precision Recall F-measure Accuracy</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> In each of the experiments below, the results are reported for the entire set of 989 essays annotated for this project. We performed ten-fold cross-validation, training our SVM classifier on 910 of the data at a time, and testing on the remaining 110. We report the results on the cross-validation set for all runs combined.</Paragraph>
    <Paragraph position="1"> For each dimension, we also report the performance of a simple baseline measure, which assumes that all of our essay coherence criteria are satisfied. That is, our baseline assigns category 1 (high relevance) to every sentence, on every dimension.</Paragraph>
    <Paragraph position="2"> These essays were written in response to six different prompts, and had an average (human-assigned) score of  sion, broken down by essay score point 4.0 on a six-point scale. Therefore, a priori, it seems possible that we could build a better baseline model by conditioning its predictions on the overall score of the essay (assigning 1's to sentences from better-scoring essays, and 0's to sentences from lower-scoring essays). However, the coherence requirements of each of our dimensions are usually met even in the lowest-scoring essays, as shown in Table 2, which lists the percentage of sentences in different essay score ranges which our human annotators assigned category 1. Looking at the highest and lowest score points on our six-point scale, it is clear that higher-scoring essays do tend to have fewer problems with coherence, but this effect is not overwhelming. (The largest gap between the highest- and lowest-scoring essays is on DimERR, which deals with errors in grammar, usage, and mechanics.)</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 DimP
</SectionTitle>
      <Paragraph position="0"> According to the protocol, there are four discourse elements for which DimP , the degree of relatedness to the essay prompt, is relevant: Background, Conclusion, Main Point, and Thesis. The Supporting Idea category of sentence is not required to be related to the prompt, because it may express an elaboration of one of the main points of the essay, and has a more tenuous and mediated logical connection to the essay prompt text.</Paragraph>
      <Paragraph position="1"> The features which we provide to the SVM for predicting a sentence's relatedness to the prompt are:  1. The RI similarity score of the target sentence with the entire essay prompt, 2. The maximum RI similarity score of the target sentence with any sentence in the essay prompt, 3. The RI similarity score of the target sentence with the required task sentence (a designated portion of the prompt text which contains an explicit directive to the student to write about a specific topic), 4. The RI similarity score of the target sentence with the entire thesis of the essay, 5. The maximum RI similarity score of the target sentence with any sentence in the thesis, 6. The maximum RI similarity score of the target sentence with any sentence in the preceding discourse chunk, 7. The number of sentences in the current chunk, 8. The offset of the target sentence (sentence number) from the beginning of the current discourse chunk, 9. The number of sentences in the current chunk whose similarity with the prompt is greater than .2, 10. The number of sentences in the current chunk whose similarity with the required task sentence is greater than .2, 11. The number of sentences in the current chunk whose similarity with the essay thesis is greater than .2, 12. The number of sentences in the current chunk whose similarity with the prompt is greater than .4, 13. The number of sentences in the current chunk whose similarity with the required task sentence is greater than .4, 14. The number of sentences in the current chunk whose similarity with the essay thesis is greater than .4, 15. The length of the target sentence in words, 16. A Boolean feature indicating whether the target sentence contains a transition word, such as &amp;quot;however&amp;quot;, or &amp;quot;although&amp;quot;, 17. A Boolean feature indicating whether the target sen- null tence contains an anaphoric element, and 18. The category of the current chunk. (This is encoded as five Boolean features: one bit for each of &amp;quot;Background&amp;quot;, &amp;quot;Conclusion&amp;quot;, &amp;quot;Main Point&amp;quot;, &amp;quot;Supporting Idea&amp;quot;, and &amp;quot;Thesis&amp;quot;.) In calculating features 2, 5, and 6, we use the maximum similarity score of the sentence with any other sentence in the relevant discourse segment, rather than simply using the similarity score of the sentence with the entire text chunk. We add this feature based on the intuition that for a sentence to be relevant to another discourse segment, it need only be connected to some part of that segment.</Paragraph>
      <Paragraph position="2"> It is perhaps surprising that we include features which measure the degree of similarity between the sentence and the thesis, since we are trying to predict its relatedness to the prompt, rather than the thesis. However, there are two reasons we believe this is fruitful. First, since we are dealing with a relatively small amount of text, comparing a single sentence to a short essay prompt, looking at the thesis as well helps to overcome data sparsity issues. Second, it may be that the relevance of the current sentence to the prompt is mediated by the student's thesis statement. For example, the prompt may ask the student to take a position on some topic. They may state this position in the thesis, and provide an example to support it as one of their Main Points. In such a case, the example would be more clearly linked to the Thesis, but this would suffice for it to be related to the prompt.</Paragraph>
      <Paragraph position="3"> Considering the similarity scores of sentences in the current discourse segment is also, in part, an attempt to overcome data sparsity issues, but is also motivated by the idea that it may be an entire discourse segment which can properly be said to be (ir)relevant to the essay prompt. The sentence length and transition word features do not directly reflect the relatedness of a sentence to the prompt, but they are likely to be useful correlates.</Paragraph>
      <Paragraph position="4"> Finally, the feature (#17) indicating the presence of a pronoun is to help the system deal with cases in which a sentence contains very few content words, but is still linked to other material in the essay by means of anaphoric elements, such as &amp;quot;This is shown by my argument.&amp;quot; In such as case, the sentence would normally get a low similarity score with the prompt (and other parts of the essay), but the information that it contains a pronoun might still allow the system to classify it correctly.</Paragraph>
      <Paragraph position="5"> Table 3 shows results using the baseline algorithm to classify sentences according to their relatedness to the prompt. Table 4 presents the results using the SVM classifier. We provide precision, recall, and f-measure for the assignment of the labels 1 and 0, and an overall accuracy measure in the far right column. (The accuracy measure is the value for precision and recall when 1 and 0 ranks are collapsed. Precision and recall will be the same, since the number of labels assigned by the model is equal to the number of labels in the target assignment.) The SVM model outperforms the baseline on every subcategory, with the largest gains on Background sentences, most of which are, in fact, unrelated to the prompt according to our human judges. This low baseline result on Background sentences could indicate that many students have a problem with providing unnecessary and irrelevant prefaces to the important points in their essays. Note that the trained SVM has around .9 recall on the class of sentences which according to our human annotators have high relevance to the prompt. This means that our system is less likely to incorrectly assign a low rank to a sentence that is high. So, the system will tend to err on the side of the student, which is a preferable trade-off. In part, this is due to the nature of the semantic similarity measure we are using, which does not take word order into account. While RI does allow us to capture a richer meaning component than simply matching words which co-occur in the target sentence and prompt, it still does not encompass all that goes into determining whether a sentence &amp;quot;relates&amp;quot; to another chunk of text. Students often write something which bears a loose topical connection with the essay prompt, but does not directly address the question. This sort of problem is hard to address with a tool such as LSA or RI; the vocabulary of the sentence on its own will not provide a clue to the sentence's failure to address the task.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 DimT
</SectionTitle>
      <Paragraph position="0"> The annotation protocol states that these four discourse elements come into play for DimT : Background, Conclusion, Main Point, and Supporting Idea. Because this dimension indicates the degree of relatedness to the thesis of the essay (and also other discourse segments in the case of Supporting Idea and Conclusion sentences; see Section 2.1.1 above), we do not consider thesis sentences with regard to this aspect of coherence.</Paragraph>
      <Paragraph position="1"> The features which we provide to the SVM for predicting whether or not a given sentence is related to the thesis are almost the same ones used for DimP . The only difference is that we omit features #12 and #13 in our model of DimT . These are the features which evaluate how many sentences in the current chunk have a similarity score with the prompt and required task sentence greater than 0.4. While DimP is to some degree sensitive to the similarity of a sentence to the thesis, and DimT can likewise benefit from the information about a sentence's similarity to the prompt, it seems that the latter link is less important, so a single cutoff suffices for this model.</Paragraph>
      <Paragraph position="2"> Tables 5-6 present the results for our SVM model and for a baseline which assigns all sentences &amp;quot;high&amp;quot; relevance. The improvements on DimT are smaller than the ones reported for DimP , but we still record an overall gain of four percentage points in accuracy. Only on conclusion sentences were we unable to produce an improvement over the baseline; we need to investigate this further. Again, the system achieves high recall on sentences with high relatedness. It outperforms the baseline by correctly identifying a modest percentage of the sentences labeled as having low relatedness with the thesis.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 DimS
</SectionTitle>
      <Paragraph position="0"> DimS, which concerns whether the target sentence relates to another sentence within the same discourse segment, seems another good candidate for applying our semantic similarity score to the task of establishing coherence. At present, however we have not made substantial progress on this task. The baselines for DimS are substantially higher than those for dimensions DimP and DimT -- 98.1% of all sentences in our data were annotated as &amp;quot;highly related&amp;quot; with respect to this dimension. This indicates that it is relatively rare to find a sentence which is not related to anything in the same discourse segment. This makes our task, to characterize those sentences which are not related to the discourse segment, much more difficult, since there are so few examples of sentences with low-ranking coherence.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 DimERR
</SectionTitle>
      <Paragraph position="0"> DimERR is clearly a different kind of problem. Here, we are looking for clarity of expression, or coherence within a sentence. We base this solely on technical correctness.</Paragraph>
      <Paragraph position="1"> We are able to automatically assign high and low ranks to essay sentences using a set of rules based on the number of grammar, usage and mechanics errors. The rules used for DimERR are as follows: a) assign a low label if the sentence is a fragment, if the sentence contains 2 or more grammar, usage, and mechanics errors, or if the sentence is a run-on, b) assign a high label if no criteria in (a) apply. Criterion's discourse analysis system also provides an essay score with e-rater(r), and qualitative feedback about grammar, usage, mechanics, and style (Leacock  and Chodorow, 2000; Burstein et al., 2003a). We can easily use Criterion's outputs about grammar, usage, and mechanics errors to assign high and low ranks to essay sentences, using the rules described in the previous section. null The performance of the module that does the DimERR assignments is in Table 7. We used half of the 292 essays from the training phase of annotation for development, and the remaining data from the training and post-training phases of annotation for cross-validation. Results are reported for the cross-validation set. Text labeled as titles, or opening or closing salutations, are not included in the results. The baselines were computed by assigning all sentences a high rank label. The baseline is high; however, the algorithm outperforms the baseline.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML