File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0304_metho.xml

Size: 18,605 bytes

Last Modified: 2025-10-06 14:14:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0304">
  <Title>Using Lexical Semantic Techniques to Classify Free-Responses</Title>
  <Section position="3" start_page="22" end_page="23" type="metho">
    <SectionTitle>
3. The Formulating-Hypotheses Item 2
</SectionTitle>
    <Paragraph position="0"> Responses from the Formulating-Hypotheses item (F-H) were used in this study. F-H is an experimental inferencing item in which an examinee is presented with a short passage (about 30 words) in which a hypothetical situation is described, and s/he composes up to 15 hypotheses that could explain Why the situation exists. Examinee responses do not have to be in complete sentences, and can be up tO 15 words in length. For example, an item referred to as the police item describes a situation in which the number of police being killed has reduced over a 20-year period. The examinee is theasked to give reasons as to why this might have occurred. Sample responses are illustrated in (1).</Paragraph>
    <Paragraph position="1">  (1) Sample correct responses to the police item a. Better cadet training programs b. Police wear bullet-proof vests c. Better economic circumstances mean less crime.</Paragraph>
    <Paragraph position="2"> d. Advanced medical technology has made it possible to save more lives.</Paragraph>
    <Paragraph position="3"> e. Crooks now have a decreased ability to purchase guns.</Paragraph>
    <Section position="1" start_page="22" end_page="23" type="sub_section">
      <SectionTitle>
3.1 Required Scoring Tasks for F-H
</SectionTitle>
      <Paragraph position="0"> Our task is to create a system which will score the data using the same criteria used in hand-scoring.</Paragraph>
      <Paragraph position="1"> In the hand-scoring process, test developers (i.e., the individuals who create and score exams) create a multiple-category rubric, that is, a scoring key, in which each category is associated with a set of correct or incorrect responses. A multiple-category rubric must be created to capture any possible response duplication that could occur in the examinees multiple response file. For instance, if an examinee had two responses, Better trained police, and Cops are more highly trained, the scoring system must identify these two responses as duplicates which should not both count toward the final score. Another reason for multiple-category assignment is to be able to provide content-relevant explanations as to why a response was scored a certain way. Our current prototype was designed to classify responses according to a set of training responses which had been hand-scored by test developers in a multiple-category rubric they had developed. For the police data set, there were 47 categories associated with a set of 200 training responses. Each rubric category had between 1 and 10 responses.</Paragraph>
      <Paragraph position="2"> 3.2. Characterization of police training data The training set responses have insufficient lexico-syntactic overlap to rely on lexical co-occurrence and frequencies to yield content information. For instance, police and better occur frequently, but in varying structures, such as in the responses, Police officers were better trained, and Police receiving better training to avoid getting killed in the line of duty. These two responses must be classified in  separate categories: (a) Better police training, general, and (b) Types of self-defense~safety techniques, respectively.</Paragraph>
      <Paragraph position="3"> Metonyms within content categories had to be manually classified, since such relations were often not derivable from real-world knowledge bases. For instance, in the training responses, A recent push in safety training has paid off for modern day police, and &amp;quot;Officers now better combat trained..., &amp;quot; the terms safety training with combat trained, needed to be related. Test developers had categorized both responses under the Trained for self-defense~safety category. Safety training and combat train were terms related to a type of training with regard to personal safety. The terms had to be identified as metonyms in order to classify the responses accurately.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="23" end_page="27" type="metho">
    <SectionTitle>
4. Strategy for Representing Police Responses
</SectionTitle>
    <Paragraph position="0"> As previously mentioned, there was insufficient lexico-syntactic patterning to use a contextual word use method, and domain-specific word use could not be derived from real-world knowledge sources.</Paragraph>
    <Paragraph position="1"> Therefore, we developed a domain-specific concept lexicon based on a set of 200 training responses over all categories. Each single, relevant word or 2-3 word term was linked to a concept entry.</Paragraph>
    <Paragraph position="2"> Small concept grammars were developed for individual rubric categories. These grammars were based on the conceptual-structural representations identified in the training response set.</Paragraph>
    <Paragraph position="3"> As much as possible, it was important that the rules represented the relationship between multiple concepts within a phrasal constituent. The phrasal constituent itself, that is, whether it was an NP or a VP did not seem relevant. It was only meaningful that a constituent relationship occurred. Without this structural information, the concepts could occur in any position in a response, and automatic category assignment would not be reliable (Burstein and Kaplan (1995)). The procedure used to identify conceptual and syntactic information, retrieves concepts within specific phrasal and clausal categories. Once a response was processed, and concept tags were assigned, all phrasal and clausal categories were collapsed into a general phrasal category, XP, for the scoring process, as illustrated in (4), below. There were some cases, however, where we had no choice but to include some single concepts, due to the limited lexico-syntactic patterning in the data.</Paragraph>
    <Section position="1" start_page="23" end_page="24" type="sub_section">
      <SectionTitle>
4.1. The Scoring Lexicon for the Police Item
</SectionTitle>
      <Paragraph position="0"> What we term the scoring lexicon can best be illustrated by Bergler's (1995) layered lexicon. The underlying idea in Bergler's approach is that the lexicon has several layers which are modular, and new layers can be plugged in for different texts. In this way, lexical entries can be linked appropriately to text-specific information. In the layered lexicon approach, words are linked to definitions within some hierarchy. Bergler's approach also has a meta-lexical layer which maps from syntactic patterns to semantic interpretation that does not affect the lexicon itself. By comparison, our scoring lexicon, contains a list of base word forms (i.e., concepts). 3 The definitions associated with these concepts were typically metonyms that were specific to the domain of the item. These metonym definitions were subordinate to the words they defined. In the spirit of the layered lexicon, the definitions associated with the superordinate concepts are modular, and can be changed given  For this study, metonyms for each concept were chosen from the entire set of single words over the whole training set, and specialized 2-word and 3-word terms (i.e., domain-specific and domain-independent idioms) which were found in the training data. The lexicon developed for this study was based on the training data from all rubric categories. In (2), below, a sample from the lexicon is given. Our concept grammars, described in Section 4.2, are in the spirit of Bergler's notion of a meta-lexical layer that provides a mapping between the syntax and semantics of individual responses.</Paragraph>
      <Paragraph position="1"> In our lexicon, concepts are preceded by #. Metonyms follow the concepts in a list. Lexical entries not preceded by # are relevant words from the set of training responses, which are metonyms of concepts. These entries will contain a pointer to a concept, indicated by '% &lt;concept&gt;'. A sample of the lexicon is illustrated below.</Paragraph>
      <Paragraph position="2"> (2) Sample from the Police Item Lexicon #BE'ITER \[ better good advance improve increase ...</Paragraph>
      <Paragraph position="3"> efficient modem well increase \]</Paragraph>
      <Paragraph position="5"/>
    </Section>
    <Section position="2" start_page="24" end_page="24" type="sub_section">
      <SectionTitle>
4.2 Concept Grammar Rules for the Police Item
</SectionTitle>
      <Paragraph position="0"> The concept grammar rule templates for mapping and classifying responses were built from the 172 training set responses in 32 categories. 4 The training data was parsed using the parser in Microsoft's Natural Language Processing Tool (see MS-NLP(1996) for a description of this tool). For this study, suffixes were removed by hand from the parsed data. Based on the syntactic parses of these responses and the lexicon, a small concept grammar was manually built for each category which characterized responses by concepts and relevant structural information. The phrasal constituents were unspecified. Sample concept grammar rules are illustrated in (3).</Paragraph>
      <Paragraph position="1">  (3) Sample Concept Grammar Rules for Types of self-defense/safety a. XP: \[POLICE\],XP: \[BETTER,TRAIN\],XP: \[SAFETY\]</Paragraph>
    </Section>
    <Section position="3" start_page="24" end_page="25" type="sub_section">
      <SectionTitle>
4.3 Processing Responses for Category Assignment
</SectionTitle>
      <Paragraph position="0"> Responses were, parsed, and then input into the phrasal node extraction program. The program extracted words and terms in Noun Phrases (NP), Verb Phrases (VP), Prepositional Phrases (PP),  representation, XP. All single XPs and combinations of XPs were matched against the concept grammars for each content category to locate rule matches. This procedure is illustrated below.  c. Collapse Phrasal Nodes: XP: \[Cops=POLICE\] XP: \[better=BETTER,trained=TRAIN\] XP: \[self-defense=SAFETY\] d. Match Tagged Nodes to Concept Grammar Rules: XP: \[POLICE\], XP:\[BETTER,TRAIN\],XP:\[SAFETY\]</Paragraph>
    </Section>
    <Section position="4" start_page="25" end_page="26" type="sub_section">
      <SectionTitle>
4.4 Does Manual Preprocessing of the Data Outweigh the Benefits of Automated Scoring?
</SectionTitle>
      <Paragraph position="0"> Since the preprocessing of this response data is done by hand, the total person-time must be considered in relation to how long it would take test developers to hand score a data set in a real-world application.</Paragraph>
      <Paragraph position="1"> We must address the issue of whether or not a computer-based method would be efficient with regard to time and cost of scoring.</Paragraph>
      <Paragraph position="2"> In this study, the manual creation of the lexicon and the concept grammar rules for this data set took two people approximately one week, or 40 hours. Currently, we are developing a program to automate the generation of the concept grammars. We expect that once this program is in place, our preprocessing time will be cut in half. So, we estimate that it would take one person approximately 8 -10 hours to create the lexicon, and another 8 - 10 hours to do the preprocessing and post-processing required in conjunction with the automatic rule generation process currently being developed.</Paragraph>
      <Paragraph position="3"> The F-H item is currently only a pilot item for the Graduate Record Examination (GRE), which administers approximately 28,000 examinees, yearly. For the F-H item, each examinee can give up to 15 responses. So, the maximum number of responses for this item over the year would be approximately 420,000. Each examinee's response set would then typically be scored by two human graders. It is difficult to estimate how long the manual scoring process would take in hours, but, presumably, it would take longer than the approximately 40 hours it took to build the lexicon and concept grammars.</Paragraph>
      <Paragraph position="4"> Certainly, it would take longer than the 20 hours estimated, once the automatic rule generator is implemented. Therefore, assuming that the accuracy of this method could be improved satisfactorily, automated scoring would appear to be a viable cost-saving and time-saving option.</Paragraph>
      <Paragraph position="5">  !</Paragraph>
    </Section>
    <Section position="5" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
5.1 Initial Results
</SectionTitle>
      <Paragraph position="0"> One hundred and seventy-two responses were used for training. These responses were used to build the lexicon and the concept grammar rules. An additional, independent set of 206 test responses from 32 content categories was run through our prototype. The following were the results.</Paragraph>
    </Section>
    <Section position="6" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
5.2 Error Accountability
</SectionTitle>
      <Paragraph position="0"> Most of the errors made in classifying the data can be accounted for by four error types: (a) lexical gap, (b) human grader misclassification, (c) concept-structure problem, (d) cross-classification. The lexical gap error characterizes cases in which a response could not be classified because it was missing a concept tag, and, therefore, did not match a rule in the grammar. In reviewing the lexical gap errors, we found that the words not recognized by the system were metonyms that did not exist in the training, and were not identified as synonyms in any of our available thesaurus or on-line dictionary sources. For instance, in the response, &amp;quot;Police are better skilled...,&amp;quot; the phrase better skilled, should be equated to better trained, but this could not be done based on the training responses, or dictionary sources. Forty percent of the errors were lexical gap errors. The second problem was human grader misclassification which accounted for ! percent of the errors. In these cases, it was clear that responses had been inadvertently misclassified, so the system either misclassified the response, also. For example, the response, Officers are better trained and more experienced so they can avoid dangerous situations, was misclassified in Better trained police, general. It is almost identical to most of the responses in the category Better intervention~crook counseling. Our. system, therefore, classified the response in Better intervention~crook counseling.</Paragraph>
      <Paragraph position="1"> Concept-structure problems made up 30 percent of the errors. These were cases in which a response could not be classified because its concept-structural patterning was different from all the concept grammar rules for all content categories. The fourth error type accounted for 17 percent of the cases in which there was significant conceptual similarity between two categories, such that categofial cross-classification occurred.</Paragraph>
    </Section>
    <Section position="7" start_page="26" end_page="27" type="sub_section">
      <SectionTitle>
5.3 Additional Results Using an Augmented Lexicon
</SectionTitle>
      <Paragraph position="0"> As discussed above, 40 percent of the errors could be accounted for by lexical gaps. We hypothesized that our results would improve if more metonyms of existing concepts were added to the lexicon. Therefore, we augmented the lexicon with metonyms that could be accessed from the  test data. We reran the scoring program, using the augmented lexicon on the same set of data. The results of this run were the following.</Paragraph>
      <Paragraph position="1">  The improvement which occurred by augmenting the lexicon further supports our procedure for classifying responses. Based on these results, we plan to explore ways to augment the lexicon without consulting the test set. Furthermore, we will use the augmented lexicon from this second experiment to score a set of 1200 new test data. 5</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="27" end_page="28" type="metho">
    <SectionTitle>
6. Conclusion
</SectionTitle>
    <Paragraph position="0"> Our results are encouraging and support the hypothesis that a lexical semantic approach can be usefully integrated into a system for scoring the free-response item described in this paper.</Paragraph>
    <Paragraph position="1"> Essentially, the results show that given a small set of data which is partitioned into several meaning classifications, core meaning can be identified by concept-structure patterns. It is crucial that a domain-specific lexicon is created to represent the concepts in the response set. Therefore, the concepts in the lexicon must denote metonyms which can be derived from the training set. Relevant synonyms of the metonyms can be added to expand the lexicon using dictionary and thesaurus sources. Using a layered lexicon approach (Bergler (1995)) allows the words in the lexicon to be maintained, while the part of the entry denoting domain-specific meaning is modular and can be replaced. The results of this case study illustrate that it is necessary to analyze content of responses based on the mapping between domain-specific concepts and the syntactic structure of a response.</Paragraph>
    <Paragraph position="2"> As mentioned earlier in the paper, previous systems did not score responses accurately due to an inability to reliably capture response paraphrases. These systems did not use structure or domain-specific lexicons in trying to analyze response content. The results show that the largest number of erroneous classifications occurred due to lexical gaps. Our second set of results shows that developing new methods to augment the lexicon would improve performance significantly. In future experiments, we plan to score an independent set of response data from the same item, using the augmented lexicon, to test the generalizability of our prototype. We realize that the results presented in this case study represent a relatively small data set. These results are encouraging, however, with regard to using a lexical semantics approach for automatic content identification on small data sets.</Paragraph>
    <Paragraph position="3"> 5We did not use these 1200 test data in the initial study, since the set of 1200 has not been scored by test developers, so we could not measure agreement with regard to human scoring decisions. However, we believe that by using the augmented lexicon, and our concept grammars to automatically score the 1200 independent data, we can get a</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML