XML Viewer - w04-3217

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3217_metho.xml
Size: 20,922 bytes
Last Modified: 2025-10-06 14:09:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3217">
  <Title>Automatic Analysis of Plot for Story Rewriting</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Plot Ratings
</SectionTitle>
    <Paragraph position="0"> The stories were rated for plot by three different raters. A story-teller (Rater B) ranked all of the stories. Two others (Rater A, a teacher, and Rater C) ranked the stories as well, although Rater A ranked only half. The following scale, devised by a teacher with over forty years of experience, was used.</Paragraph>
    <Paragraph position="1">  1. Excellent: An excellent story shows that the  reader understands the &amp;quot;point&amp;quot; of the story and should demonstrate some deep understanding of the plot. The pupil should be able to retrieve all the important links and, not all the details, but the right details.</Paragraph>
    <Paragraph position="2">  2. Good: A good story shows that the pupil was listening to the story, and can recall the main 1The exemplar story used in our corpus was &amp;quot;Nils' Adventure,&amp;quot; a story from &amp;quot;The Wonderful Adventures of Nils&amp;quot; (Lagerloff, 1907).</Paragraph>
    <Paragraph position="3">  events and links in the plot. However, the pupil shows no deeper understanding of the plot, which can often be detected by the pupil leaving out an important link or emphasizing the wrong details.</Paragraph>
    <Paragraph position="4">  3. Fair: A fair story shows that the pupil is missing more than one link or chunk of the story, and not only lacks an understanding of the &amp;quot;point&amp;quot; but also lacks recall of vital parts of the story. A fair story does not really flow. 4. Poor: A poor story has definite problems with  recall of events, and is missing substantial amount of the plot. Characters will be misidentified and events confused. Often the child writes on the wrong subject or starts off reciting only the beginning of the story.</Paragraph>
    <Paragraph position="5"> Rater B and Rater A had an agreement of 39% while Rater B and Rater C had an agreement of 77%. However, these numbers are misleading as the rating scale is ordinal and almost all the disagreements were the result of grading a story either one rank better or worse. In particular Rater A usually marked incomplete stories as poor while the other raters assigned partial credit. To evaluate the reliability of the grades both Cronbach's and Kendall's b were used, since these statistics take into account ordinal scales and inter-rater reliability. Between Rater A and B there was a Cronbach's statistic of .86 and a Kendall's b statistic of .72. Between Rater B and C there was a Cronbach's statistic of .93 and Kendall's b statistic of .82. These statistics show our rating scheme to be fairly reliable. As the most qualified expert to rate all the stories, Rater B's ratings were used as the gold standard. The distribution of plot ratings are given in Table 1.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 A Minimal Event Calculus
</SectionTitle>
    <Paragraph position="0"> The most similar discourse analysis program to the one needed by StoryStation is the essay-grading component of &amp;quot;Criterion&amp;quot; by ETS technologies (Burstein et al., 2003), which is designed to annotate parts of an essay according to categories such as &amp;quot;Thesis, &amp;quot;Main Points,&amp;quot; &amp;quot;Support,&amp;quot; and &amp;quot;Conclusion.&amp;quot; Burstein et. al. (2003) uses Rhetorical Structure Theory to parse the text into discourse relations based on satellites and nuclei connected by rhetorical relations. Moore and Pollack (1992) note that Rhetorical Structure Theory conflates the informational (the information being conveyed) and intentional (the effects on the reader's beliefs or attitudes) levels of discourse. Narratives are primarily informational, and so tend to degenerate to long sequences of elaboration or sequence relations. Since in the story rewriting task the students are attempting to convey information about the narrative, unlike the primarily persuasive task of an essay, our system focuses on the informational level as embodied by a simplified event calculus. Another tutoring system similar to ours is the WHY physics tutoring system (Rose et al., 2002).</Paragraph>
    <Paragraph position="1"> We formulate only three categories to describe stories: events, event names, and entities. This formulation keeps the categories from being arbitrary or exploding in number. Entities are both animate characters, such as &amp;quot;elves&amp;quot; and &amp;quot;storks,&amp;quot; and inanimate objects like &amp;quot;sand&amp;quot; and &amp;quot;weather.&amp;quot; Nouns are the most common type of entities. Events are composed of the relationships among entities, such as &amp;quot;the boy becomes an elf,&amp;quot; which is composed of a &amp;quot;boy&amp;quot; and &amp;quot;elf&amp;quot; interacting via &amp;quot;becoming,&amp;quot; which we call the event name. This is because the use of such verbs is an indicator of the presence of an event in the story. In this manner events are relationships labeled with an event name, and entities are arguments to these relationships as in propositional logic. Together these can form events such as become(boy,elf), and this formulation maps partially onto Shanahan's event calculus which has been used in other story-understanding models (Mueller, 2003). The key difference between an event calculus and a collection of propositions is that time is explicitly represented in the event calculus.</Paragraph>
    <Paragraph position="2"> Each story consists of a group of events that are present in the story, e1:::eh. Each event consists of an event name, a time variable t, and a set of entities arranged in an ordered set n1:::na. An event must contain one and only one event name. The event names are usually verbs, while the entities tend to be, but are not exclusively, nouns. Time is made explicit through a variable t. Normally, the Shanahan event calculus has a series of predicates to deal with relations of achievements, accomplishments, and other types of temporal relations (Shanahan, 1997), however our calculus does not use these since it is difficult to extract these from ungrammatical raw text automatically. A story's temporal order is a partial ordering of events as denoted by their time variable t. When incorporating a set of entities into an event, a superscript is used to keep the entities distinct, as n13 is entity 1 in event 3. An entity may appear in multiple events, such as entity 1 appearing in event 3 (n13) and in event 5 (n15). The plot of a story can then be considered an event structure of the following form if it has h events:</Paragraph>
    <Paragraph position="4"> Where time t1 t2 :::th. An example from a rewritten story is &amp;quot;Nils found a coin and he walked round a sandy beach. He talked to the stork. Asked a question.&amp;quot; This is represented by an event structure as:</Paragraph>
    <Paragraph position="6"> Note that the rewritten stories are often ungrammatical. A sentence may map onto one, multiple, or no events. Two stories match if they are composed of the same ordering of events.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Extracting the Event Calculus
</SectionTitle>
    <Paragraph position="0"> The event calculus can be extracted from raw text by layering NLP modules using an XML-based pipeline. Our main constraint was that the text of the pupil was rarely grammatical, restricting our choice of NLP components to those that did not require a correct parse or were in any other ways dependent on grammatical sentences. At each level of processing, an XML-enabled natural language processing component can add mark-up to the text, and use any mark-up that the previous components made. All layers in the pipeline are fully automatic. For our pipeline we used LT-TTT (Language Technology Text Tokenization Toolkit) (Grover et al., 2000).</Paragraph>
    <Paragraph position="1"> Once words are tokenized and sentence boundaries detected by LT-TTT, LT-POS tags the words using the Penn Treebank tag-set without parsing the sentences. While a full parse could be generated by a statistical parser, such parses would likely be incorrect for the ungrammatical sentences often generated by the pupils (Charniak, 2000). Pronouns are resolved using a cascading rule-based approach directly inspired by the CogNIAC algorithm (Baldwin, 1997) with two variations. First, it resolves in distinct cascades for singular and then plural pronouns. Second, it resolves using only the CogNIAC rules that can be determined using Penn Tree-bank tags. The words are lemmatized using an augmented version of the SCOL Toolset and sentences are chunked using the Cass Chunker (Abney, 1995).</Paragraph>
    <Paragraph position="2"> There is a trade-off between this chunking approach that works on ungrammatical sentences and one that requires a full parse such as those using dependency grammars. The Cass Chunker is highly precise, but often inaccurate and misses relations and entities that are not in a chunk. In its favor, those tuples in chunks that it does identify are usually correct. SCOL extracts tuples from the chunks to determine the presence of events, and the remaining elements in the chunk are inspected via rules for entities. Time is explicitly identified using a variation of the &amp;quot;now point&amp;quot; algorithm (Allen, 1987). We map each event's time variable to a time-line, assuming that events occur in the order in which they appear in the text. While temporal ordering of events is hard (Mani and Wilson, 2003), given that children of this age tend to use a single tense throughout the narrative and that in narratives events are presented in order (Hickmann, 2003), this simple algorithm should suffice for ordering in the domain of children's stories.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Plot Comparison Algorithm
</SectionTitle>
    <Paragraph position="0"> Since the story rewriting task involves imperfect recall, story events will likely be changed or left out by the pupil. The story rewriting task involves the students choosing their own diction and expressing their own unique mastery of language, so variation in how the fundamental elements of the story are rewritten is to be expected. To deal with these issues, an algorithm had to be devised that takes the event structure of the rewritten story and compares it to the event structure of the exemplar story, while disregarding the particularities of diction and grammar. The problem is one of credit allocation for the similarity of rewritten events to the exemplar event.</Paragraph>
    <Paragraph position="1"> The words used in the events of the two story models may differ. The exemplar story model might use the event see(Nils,stork), but a rewritten story may use the word &amp;quot;bird&amp;quot; instead of the more precise word &amp;quot;stork.&amp;quot; However, since the &amp;quot;bird&amp;quot; is referring to the stork in the exemplar story, partial credit should be assigned. A plot comparison algorithm was created that uses abstract event calculus representations of plot and the text of the rewritten story, taking into account temporal order and word similarity. The exemplar story's event structure is created by applying the event extraction pipeline to the storyteller's transcript.</Paragraph>
    <Paragraph position="2"> The Plot Comparison Algorithm is given in Figure 1. In the pseudo-code, E of size h and R of size j are the event structures of the exemplar story and rewritten story respectively, with the names of each of their events denoted as e and r. The set of entities of each event are denoted as Ne and Nr respectively.</Paragraph>
    <Paragraph position="3"> T is the lemmatized tokens of the rewritten story's raw text. WordNet(x) denotes the synset of x. The &amp;quot;now point&amp;quot; of the rewritten story is t, and feature set is f, which has an index of i. The index i is incremented every time f is assigned a value. 1 denotes an exact match, 2 a WordNet synset match, 3 a match in the text, and 0 a failure to find any match.</Paragraph>
    <Paragraph position="4"> The Plot Comparison Algorithm essentially iterates through the exemplar story looking for matches of the events in the rewritten story. To find if two events are in or out of order the rewritten story has a &amp;quot;now point&amp;quot; that serves as the beginning of its iteration. Each event of the event structure of the exemplar story is matched against each event of the rewritten story starting at the &amp;quot;now point&amp;quot; and using the exact text of the event name. If that match fails a looser match is attempted by giving the event names of the rewritten story to WordNet and seeing if a match to the resultant synset succeeds (Fellbaum, 1998). If either match attempt succeeds, the algorithm attempts to match entities in the same fashion and the &amp;quot;now point&amp;quot; of the rewritten story is incremented. Thus the algorithm does not looks back in the rewritten story for a match. If the event match fails, one last attempt is made by checking the event name or entity against every lemmatized token in the entire rewritten text. If this fails, a failure is recorded. The results of the algorithm are can be used as a feature set for machine-learning. The event calculus extraction pipeline and the Plot Comparison Algorithm can produce event calculus representations of any English text and compare them.</Paragraph>
    <Paragraph position="5"> They have been tested on other stories that do not have a significant corpus of rewritten stories. The number of events for an average rewritten story in our corpus was 26, with each event having an average of 1 entity.</Paragraph>
    <Paragraph position="6"> Included in Figure 2 is sample output from our algorithm given the exemplar story model ea and a rewritten story rb whose text is as follows: Nils took the coin and tossed it away, cause it was worthless.</Paragraph>
    <Paragraph position="7"> A city appeared and so he walked in. Everywhere was gold and the merchant said Buy this Only one coin Nils has no coin. So he went to get the coin he threw away but the city vanished just like that right behind him. Nils asked the bird Hey where the city go? Let's go home.</Paragraph>
    <Paragraph position="8"> Due to space limitations, we only display selected events from the transcript and their most likely match from the rewritten story in Figure 2. The output of the feature set would be the concatenation in order of every value of fe.</Paragraph>
    <Paragraph position="10"> for ex e1 to eh do for ry rt to rj</Paragraph>
    <Paragraph position="12"/>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Learning the Significance of Events
</SectionTitle>
    <Paragraph position="0"> Machine-learning is crucial to our experiment, as it will allow our model to discriminate what events and words in a rewritten story are good predictors of plot quality as rated by a human expert. We have restricted our feature set to the results of the Plot Comparison Algorithm and LSA scores, as we describe below. Other possible features, such as the grammatical correctness and the number of conjunctives, are dealt with by other agents in StoryStation. We are focusing on plot recall quality as opposed to general writing quality. Two different machine-learning algorithms with differing assumptions were used. These are by no means exhaustive of the options, and extensive tests have been done with other algorithms. Further experiments are needed to understand the precise nature of the relations between the feature set and machine learning algorithms. All results were created by ten-fold cross validation over the rated stories, which is especially important given our small corpus size.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.1 Nearest Neighbors using LSA
</SectionTitle>
      <Paragraph position="0"> We can classify the stories without using the results of the Plot Comparison Algorithm, and instead use only their statistical attributes. Latent Semantic Analysis (LSA) provides an approximation of &amp;quot;semantic&amp;quot; similarity based on the hypothesis that the semantics of a word can be deduced from its context in an entire document, leading to useful coherency scores when whole documents are compared (Foltz et al., 1998). LSA compares the text of each rewritten story in the corpus for similarity to the transcript of the exemplar story in a subspace produced by reducing the dimensionality of the TASA 12 grade USA reading-level to 200. This dimensionality was discovered through experimentation to be our problem's optimal parameters for LSA given the range of choices originally used by Landauer (1997). The stories can be easily classified by grouping them together based on LSA similarity scores alone, and this technique is embodied in the simple K-Nearest Neighbors (K-NN) learner. K-NN makes no parametric assumptions about the data and uses no formal symbolic features other than an LSA similarity score. For K-NN k = 4 gave the best results over a large range of k, and we expect this k would be ideal for stories of similar length.</Paragraph>
      <Paragraph position="1"> As shown in Table 2, despite its simplicity this algorithm performs fairly well. It is not surprising that features based primarily on word distributions such as LSA could correctly discriminate the non-poor from the poor rewritten stories. Some good rewritten stories closely resemble the exemplar story almost word for word, and so share the same word distribution with the exemplar story. Poor rewritten stories usually have little resemblance to the exemplar story, and so have a drastically different word distribution. The high spread of error in classifying stories is shown in the confusion matrix in Table 3.</Paragraph>
      <Paragraph position="2"> This leads to unacceptable errors such as excellent stories being classified as poor stories.</Paragraph>
      <Paragraph position="3"> 7.2 Hybrid Model with Naive Bayes By using both LSA scores and event structures as features for a statistical machine learner, a hybrid model of plot rating can be created. In hybrid mod- null els a formal symbolic model (the event calculus-based results of a Plot Comparison Algorithm) enters a mutually beneficial relationship with a statistical model of the data (LSA), mediated by a machine learner (Naive Bayes). One way to combine LSA similarity scores and the results of the event structure is by using the Naive Bayes (NB) machine learner. NB makes the assumptions of both parametrization and Conditional Independence.</Paragraph>
      <Paragraph position="4"> The recall and precision per rank is given in Table 4, and it is clear that while no stories are classified as excellent at all, the majority of good and poor stories are identified correctly. As shown by the confusion matrix in Table 5, NB does not detect excellent stories and it collapses the distinction between good and excellent stories. Compared to K-NN with LSA, NB shows less spread in its errors, although it does confuse some poor stories as good and one excellent story as fair. Even though it mistakenly classifies some poor stories as good, for many teachers this is better than misidentifying a good story as a poor story.</Paragraph>
      <Paragraph position="5"> The raw accuracy results over all classes of the machine learning algorithms are summarized in Table 6. Note that average human rater agreement is the average agreement between Rater A and C (whose agreement ranged from 39% to 77%), since Rater B's ratings were used as the gold standard.</Paragraph>
      <Paragraph position="6"> This average also assumes Rater A would have continued marking at the same accuracy for the com- null plete corpus. DT refers to an ID3 Decision Tree algorithm that creates a purely symbolic machinelearner whose feature set was only the results of the Plot Comparison Algorithm (Quinlan, 1986). It performed worse than K-NN and thus the details are not reported any further. Using NB and combining the LSA scores with the results of the Plot Comparison Algorithm produces better raw performance than K-NN. Recall of 54% for NB may seem disappointing, but given that the raters only have an average agreement of 58%, the performance of the machine learner is reasonable. So if the machinelearner had a recall of 75% it would be suspect. Statistics to compare the results given the ordinal nature of our rating scheme are shown in Table 7.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML