File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0201_metho.xml

Size: 12,282 bytes

Last Modified: 2025-10-06 14:08:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0201">
  <Title>Marineau Heather Hite-Mitchell</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 An utterance taxonomy
</SectionTitle>
    <Paragraph position="0"> The framework for utterance classification in Table 1 is familiar to taxonomies in the cognitive sciences (Graesser et al. 1992; Graesser and Person 1994). The most notable system within this framework is QUALM (Lehnert 1978), which utilizes twelve of the question categories. The taxonomy can be divided into 3 distinct groups, questions, frozen expressions, and contributions. Each of these will be discussed in turn.</Paragraph>
    <Paragraph position="1"> The conceptual basis of the question categories arises from the observation that the same question may be asked in different ways, e.g. &amp;quot;What happened?&amp;quot; and &amp;quot;How did this happen?&amp;quot; Correspondingly, a single lexical stem for a question, like &amp;quot;What&amp;quot; can be polysemous, e.g. both in a definition category, &amp;quot;What is the definition of gravity?&amp;quot; and metacommunicative, &amp;quot;What did you say?&amp;quot; Furthermore, implicit questions can arise in tutoring via directives and some assertions, e.g. &amp;quot;Tell me about gravity&amp;quot; and &amp;quot;I don't know what gravity is.&amp;quot; In AutoTutor these information seeking utterances are classified to one of the 16 question categories.</Paragraph>
    <Paragraph position="2"> The emphases on queried concepts rather than orthographic forms make the categories listed in Table 1 bear a strong resemblance to speech acts. Indeed, Graesser et al. (1992) propose that the categories be distinguished in precisely the same way as speech acts, using semantic, conceptual, and pragmatic criteria as opposed to syntactic and lexical criteria. Speech acts presumably transcend these surface criteria: it is not what is being said as what is done by the saying (Austin, 1962; Searle, 1975).</Paragraph>
    <Paragraph position="3"> The close relation to speech acts underscores what a difficult task classifying conceptual questions can be.</Paragraph>
    <Paragraph position="4"> Jurafsky and Martin (2000) describe the problem of interpreting speech acts using pragmatic and semantic inference as AI-complete, i.e. impossible without creating a full artificial intelligence. The alternative explored in this paper is cue or surface-based classification, using no context.</Paragraph>
    <Paragraph position="5"> It is particularly pertinent to the present discussion that the sixteen qualitative categories are employed in a quantitative classification process. That is to say that for the present purposes of classification, a question must belong to one and only one category. On the one hand this idealization is necessary to obtain easily analyzed performance data and to create a well-balanced training corpus. On the other hand, it is not entirely accurate because some questions may be assigned to multiple categories, suggesting a polythetic coding scheme (Graesser et al. 1992). Inter-rater reliability is used in the current study as a benchmark to gauge this potential effect.</Paragraph>
    <Paragraph position="6"> Frozen expressions consist of metacognitive and metacommunicative utterances. Metacognitive utterances describe the cognitive state of the student, and they therefore require a different response than questions or assertions. AutoTutor responds to metacognitive utterances with canned expressions such as, &amp;quot;Why don't you give me what you know, and we'll take it from there.&amp;quot; Metacommunicative acts likewise refer to the dialogue between tutor and student, often calling for a repetition of the tutor's last utterance. Two key points are worth noting: frozen expressions have a much smaller variability than questions or contributions, and frozen expressions may be followed by some content, making them more properly treated as questions. For example, &amp;quot;I don't understand&amp;quot; is frozen, but &amp;quot;I don't understand gravity&amp;quot; is a more appropriately a question. Contributions in the taxonomy can be viewed as anything that is not frozen or a question; in fact, that is essentially how the classifier works. Contributions in AutoTutor, either as responses to questions or unprompted, are tracked to evaluate student performance via LSA, forming the basis for feedback.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Classifier Algorithm
</SectionTitle>
    <Paragraph position="0"> The present approach ignores the semantic and pragmatic context of the questions, and utilizes surface features to classify questions. This shallow approach parallels work in question answering (Srihari and Li 2000; Soubbotin and Soubbotin 2002; Moldovan et al 1999). Specifically, the classifier uses tagging provided by ApplePie (Sekine and Grishman 1995) followed by cascaded finite state transducers defining the categories.</Paragraph>
    <Paragraph position="1"> The finite state transducers are roughly described in  and a disambiguation routine is applied at the end to select a single category.</Paragraph>
    <Paragraph position="2">  Does the pumpkin land in his hands? Is the pumpkin accelerating or decelerating? Where will the pumpkin land? What are the components of the forces acting on the pumpkin? How far will the pumpkin travel? What is acceleration? What is an example of Newton's Third Law? What is the difference between speed and velocity? What is happening in this situation with the runner and pumpkin? What caused the pumpkin to fall? What happens when the runner speeds up? Why did you ignore air resistance? How do you calculate force?  Immediately after tagging, transducers are applied to check for frozen expressions. A frozen expression must match, and the utterance must be free of any nouns, i.e. not frozen+content, for the utterance to be classified as frozen. Next the utterance is checked for question stems, e.g. WHAT, HOW, WHY, etc. and question mark punctuation. If question stems are buried in the utterance, e.g. &amp;quot;I don't know what gravity is&amp;quot;, a movement rule transforms the utterance, placing the stem at the beginning. Likewise if a question ends with a question mark but has no stem, an AUX stem is placed at the beginning of the utterance. In this way the same transducers can be applied to both direct and indirect questions. At this stage, if the utterance does not possess a question stem and is not followed by a question mark, the utterance is classified as a contribution.</Paragraph>
    <Paragraph position="3"> Two sets of finite state transducers are applied to potential questions, keyword transducers and syntactic pattern transducers. Keyword transducers replace a set of keywords specific to a category with a symbol for that category. This extra step simplifies the syntactic pattern transducers that look for the category symbol in their pattern. The definition keyword transducer, for example, replaces &amp;quot;definition&amp;quot;, &amp;quot;define&amp;quot;, &amp;quot;meaning&amp;quot;, &amp;quot;means&amp;quot;, and &amp;quot;understanding&amp;quot; with &amp;quot;KEYDEF&amp;quot;. For most categories, the keyword list is quite extensive and exceeds the space limitations of Table 2. Keyword transducers also add the category symbol to a list when they match; this list is used for disambiguation. Syntactic pattern transducers likewise match, putting a category symbol on a separate disambiguation list.</Paragraph>
    <Paragraph position="4"> In the disambiguation routine, both lists are consulted, and the first category symbol found on both lists determines the classification of the utterance. Clearly  ordering of transducers affects which symbols are closest to the beginning of the list. Ordering is particularly relevant when considering categories like concept completion, which match more freely than other categories. Ordering gives rarer and stricter categories a chance to match first; this strategy is common in stemming (Paice 1990).</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Training
</SectionTitle>
    <Paragraph position="0"> The classifier was built by hand in a cyclical process of inspecting questions, inducing rules, and testing the results. The training data was derived from brainstorming sessions whose goal was to generate questions as lexically and syntactically distinct as possible. Of the brainstormed questions, only when all five raters agreed on the category was a question used for training; this approach filtered out polythetic questions and left only archetypes.</Paragraph>
    <Paragraph position="1"> Intuitive analysis suggested that the majority of questions have at most a two-part pattern consisting of a syntactic template and/or a keyword identifiable for that category. A trivial example is disjunction, whose syntactic template is auxiliary-initial and corresponding keyword is &amp;quot;or&amp;quot;. Other categories were similarly defined either by one or more patterns of initial constituents, or a keyword, or both. To promote generalizability, extra care was given not to overfit the training data. Specifically, keywords or syntactic patterns were only used to define categories when they occurred more than once or were judged highly diagnostic. null  The results of the training process are shown in Table 4. Results from each category were compiled in 2 x 2 contingency tables like Table 3, where tp stands for &amp;quot;true positive&amp;quot; and fn for &amp;quot;false negative&amp;quot;. Recall, fallout, precision, and f-measure were calculated in the following way for each category:</Paragraph>
    <Paragraph position="3"> Recall and fallout are often used in signal detection analysis to calculate a measure called d' (Green and Swets 1966). Under this analysis, the performance of the classifier is significantly more favorable than under the F-measure, principally because the fallout, or false alarm rate, is so low. Both in training and evaluation, however, the data violate assumptions of normality that d' requires.</Paragraph>
    <Paragraph position="4"> As explained in Section 3, a contribution classification is the default when no other classification can be given. As such, no training data was created for contributions. Likewise frozen expressions were judged to be essentially a closed class of phrases and do not require training. Absence of training results for these categories is represented by double stars in Table 4.</Paragraph>
    <Paragraph position="5"> During the training process, the classifier was never tested on unseen data. A number of factors it difficult to obtain questions suitable for testing purposes. Brainstormed questions are an unreliable source of testing data because they are not randomly sampled. In general, corpora proved to be an unsatisfactory source of questions due to low inter-rater reliability and skewed distribution of categories.</Paragraph>
    <Paragraph position="6"> Low inter-rater reliability often could be traced to anaphora and pragmatic context. For example, the question &amp;quot;Do you know what the concept of group cell is?&amp;quot; might license a definition or verification, depending on the common ground. &amp;quot;Do you know what it is?&amp;quot; could equally license a number of categories, depending on the referent of &amp;quot;it&amp;quot;. Such questions are clearly beyond the scope of a classifier that does not use context. The skewed distribution of the question categories and their infrequency necessitates use of an extraction algorithm to locate them. Simply looking for question marks is not enough: our estimates predict that raters would need to classify more than 5,000 questions extracted from the Wall Street Journal this way to get a mere 20 instances of the rarest types. A bootstrapping approach using machine learning is a possible alternative that will be explored in the future (Abney 2002). Regardless of these difficulties, the strongest evaluation results from using the classifier in a real world task, with real world data.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML