File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1051_metho.xml

Size: 17,463 bytes

Last Modified: 2025-10-06 14:07:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1051">
  <Title>Error Profiling: Toward a Model of English Acquisition for Deaf Learners</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Profiling Language Errors
</SectionTitle>
    <Paragraph position="0"> We have established the need for a description of the general progress of English acquisition as determined by the mastery of grammatical forms.</Paragraph>
    <Paragraph position="1"> We have undertaken a series of studies to establish an order-of-acquisition model for our learner population, native users of American Sign Language. null In our first efforts, we have been guided by the observation that the errors committed by learners at different stages of acquisition are clues to the natural order that acquisition follows (Corder, 1967). The theory is that one expects to find errors on elements currently being acquired; thus errors made by early learners and not by more advanced learners represent structures which the early learners are working on but which the advanced learners have acquired. Having obtained a corpus of writing samples from 106 deaf individuals, we sought to establish &amp;quot;error profiles&amp;quot;-namely, descriptions of the different errors committed by learners at different levels of language competence. These profiles could then be a piece of evidence used to provide an ordering structure on the grammatical elements captured in the SLALOM model.</Paragraph>
    <Paragraph position="2"> This is an overview of the process by which we developed our error profiles: Goal : to have error profiles that indicate what level of acquisition is most strongly associated with which grammatical errors. It is important that the errors correspond to our grammar mal-rules so that the system can prefer parses which contain the errors most consistent with the student's level of acquisition. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Method :
</SectionTitle>
      <Paragraph position="0"> 1. Collect writing samples from our user population 2. Tag samples in a consistent manner with a set of error codes (where these codes have an established correspondence with the system grammar) 3. Divide samples into the levels of acquisition they represent 4. Statistically analyze errors within each level and compare to the magnitude of occurrence at other levels 5. Analyze resulting findings to determine  a progression of competence In (Michaud et al., 2001) we discuss the initial steps we took in this process, including the development of a list of error codes documented by a coding manual, the verification of our manual and coding scheme by testing inter-coder reliability in a subset of the corpus (where we achieved a Kappa agreement score (Carletta, 1996) of a0 a1a3a2a5a4a7a6 )2, and the subsequent tagging of the entire corpus. Once the corpus was annotated with the errors each sentence contained, we obtained expert evaluations of overall proficiency levels performed by ESL instructors using the national Test of Written English (TWE) ratings3. The initial analysis we go on to describe in (Michaud et al., 2001) confirmed that clustering algorithms looking at the relative magnitude of different errors grouped the samples in a manner which corresponded to where they appeared in the spectrum of proficiency represented by the corpus. The next step, the results of which we discuss here, was to look at each error we tagged and the ability of the level of the writer's proficiency to predict which errors he or she would commit. If we found significant differences in the errors committed by writers of different TWE scores, then we could use the errors to help organize the SLALOM elements, and through that obtain data on which errors to expect given a user's level of proficiency.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Toward an error profile
</SectionTitle>
      <Paragraph position="0"> Although our samples were scored on the six-point TWE scale, we had sparse data at either end of the scale (only 5% of the samples occurring in levels 1, 5, and 6), so we concentrated our efforts on the three middle levels (2, 3, and 4), which we renamed low, middle, and high.</Paragraph>
      <Paragraph position="1"> Our chosen method of data exploration was Multivariate Analysis of Variance (MANOVA).</Paragraph>
      <Paragraph position="2"> An initial concern was to put the samples on equal footing despite the fact that they covered a broad range in length--from 2 to 58 sentences--and there was a danger that longer samples would tend  with respect to the amount of English training and the age of the writer, we expected to see a range of demonstrated proficiency for reasons discussed above. We discuss later why the ratings were not as well spread-out as we expected.  likely to commit.</Paragraph>
      <Paragraph position="3"> to have higher error counts in every category simply because the authors had more opportunity to make errors. We therefore used two dependent variables in our analysis: the TWE score and the length of the sample, testing the ability of the two combined to predict the number of times a given error occurred. We ran the MANOVA using both sentence count and word count as possible length variables, and in both runs we obtained many statistically significant differences between the magnitude at which writers at different TWE levels committed certain errors. These differences are illustrated in Figure 3, which shows the results on a subset of the 47 error code tags for which we got discernible results4.</Paragraph>
      <Paragraph position="4"> In the figure, a bar indicates that this level of proficiency committed this type of error more frequently than the others. If two of the three levels are both marked, it means that they both committed the error more frequently than the third, but the difference between those two levels was unremarkable. Solid shading indicates results which were statistically significant (with an omnibus test yielding of significance level of a31 a2a33a32a35a34 ), and intensity differences (e.g., black for extra preposition in the low level, but grey in the middle level) indicate a difference that was not significant. In the example we just mentioned, the low-level writers committed more extra preposition errors than the high-level writers with a significance level of 0.0082, and the mid-level writers also committed more of these errors than the high-level writers with a significance of .0083. The comparison of the low and middle levels to each other, on the other hand, showed that the low-level learners committed more of this error, but that the result was strongly insignificant at .5831.</Paragraph>
      <Paragraph position="5"> The cross-hatched and diagonal-striped results in the figure indicate results which did not satisfy the cutoff of a31 a2a33a32a35a34 for significance, but were considered both interesting and close enough to significance to be worth noting. The diagonal stripes have &amp;quot;less intensity&amp;quot; and thus indicate the same relationship to the cross-hatched bars as the gray does to the black--a difference in the data which indicates a lower occurrence of the error which is not significantly distinguished (e.g., high-level learners committed extra relative pronoun errors less often than mid-level learners, and both highand mid-level learners committed it more often than the low-level learners), but, again, not to a significant extent.</Paragraph>
      <Paragraph position="6"> Notice that the overall shape of the figure supports the notion of an order of acquisition of features because one can see a &amp;quot;progression&amp;quot; of errors from level to level. Very strongly supportive of this intuition are the first and last errors in the figure: &amp;quot;no parse,&amp;quot; indicating that the coder 4&amp;quot;Activity&amp;quot; refers to the ability to correctly form a gerund-fronted phrase describing an activity, such as &amp;quot;I really like walking the dog;&amp;quot; &amp;quot;comparison phrase&amp;quot; refers to the formation of phrases such as &amp;quot;He is smarter than she;&amp;quot; &amp;quot;voice&amp;quot; refers to the confusion between using active and passive voice, such as &amp;quot;The soloist was sung.&amp;quot; was unable to understand the intent of the sentence, statistically more often at the lowest level than the at the other two levels, while &amp;quot;no errors found&amp;quot; was significantly most prevalent at the highest level (both with a significance level of a31 a2a33a32 a32 a32 a0 ).</Paragraph>
      <Paragraph position="7"> Other data which is more relevant to our goals also presents itself. The lowest level exhibited higher numbers of errors on such elementary language skills as putting plural markers on nouns, placing adjectives before the noun they modify, and using conjunctions to concatenate clauses correctly. Both the low and middle levels struggled with many issues regarding forming tenses, and also exhibited &amp;quot;ASLisms&amp;quot; in their English, such as the dropping of constituents which are either not explicitly realized in ASL (such as determiners, prepositions, verb subjects and objects which are established discourse entities in focus, and the verb &amp;quot;TO BE&amp;quot;), or the treatment of certain discourse entities as they would be in ASL (e.g., using &amp;quot;here&amp;quot; as if it were a pronoun). While beginning learners struggled with more fundamental problems with subordinate clauses such as missing gaps, the more advanced learners struggled with using the correct relative pronouns to connect those clauses to their matrix sentence. Where the lower two levels committed more errors with missing determiners, the highest level among our writers had learned the necessity of determiners in English but was over-generalizing the rule and using them where they were not appropriate.</Paragraph>
      <Paragraph position="8"> Finally, the upper level learners were beginning to experiment with more complex verb constructions such as the passive voice. All of this begins to draw a picture of the sequence in which these structures are mastered across these levels.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Discussion
</SectionTitle>
      <Paragraph position="0"> While Figure 3 is meant to illustrate how the three different levels committed different sets of errors, it is clear that this picture is incomplete. The low and middle levels are insufficiently distinguished from each other, and there were very few errors committed most often by the highest level. Most importantly, many of the distinctions between levels were not achieved to a significant degree.</Paragraph>
      <Paragraph position="1"> One of the reasons for these problems is the fact that our samples are concentrated in only three levels in the center of the TWE spectrum.</Paragraph>
      <Paragraph position="2"> We hope to address this in the future by acquiring additional samples. Another problem which additional samples will help to solve is sparseness of data. Across our 106 samples and 68 error codes, only 30 codes occur more than 25 times in the corpus, and only 21 codes occur more than 50 times.</Paragraph>
      <Paragraph position="3"> Most of our insignificant differences come from error codes with very low frequency, sometimes occurring as infrequently as 7 times.</Paragraph>
      <Paragraph position="4"> What we have established is promising, however, in that it does show statistically significant data spanning nearly every syntactic category.</Paragraph>
      <Paragraph position="5"> Additional samples must be collected and analyzed to obtain more statistical significance; however, the methodology and approach are proven solid by these results.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Future Work: Performance Profiles
</SectionTitle>
    <Paragraph position="0"> If we were to stop here, then our user model design would simply be to group the SLALOM contents addressed by these errors in an order according to how they fell into the distribution shown in Figure 3, assuming essentially that those errors falling primarily in the low-level group represent structures that are learned first, followed by those in the low/middle overlap area, followed by those which mostly the mid-level writers were struggling, followed finally by those which mostly posed problems for our highest-level writers.</Paragraph>
    <Paragraph position="1"> Given this structure, and a general classification of a given user, if we are attempting to select between competing parses for a sentence written by this user, we can prefer a sentence whose errors most closely fit those for the profile to which the user belongs. However, up until now we have only gathered information on the errors committed by our learner population, and thus we still have no information on a great deal of grammatical constructions. Consider that some types of grammatical constructions may be avoided or used correctly at low levels but that the system would have no knowledge of this. By only modeling the errors, we fail to capture the acquisition order data provided by knowing what structures a writer can successfully execute at the different levels. Therefore, the sparse data problems we faced in this work are only partly explained by the small corpus and some infrequent error codes.</Paragraph>
    <Paragraph position="2"> They are also explained by the fact that errors are only one half of the total picture of user performance. null Although we experimented in this work with equalizing the error counts using different length measures, we did not have access to the numbers that would have provided the most meaningful normalization: namely, the number of times a structure is attempted. It is our belief that information on the successful structures in the users' writing would give us a much clearer view of the students' performance at each level. Tagging all sentences for the correct structures, however, is an intractable task for a human coder. On the other hand, while it is feasible to have this information collected computationally through our parser, we are still faced with the problem of competing parses for many sentences. Our methodology to address this problem is to use the human-generated error codes to select among the parses trees in order to gather statistics on fully-parsed sentences.</Paragraph>
    <Paragraph position="3"> We have therefore created a modified version of our user interface which, when given a sample of writing from our corpus, records all competing parse trees for all sentences to a text file5. Another application has been developed to compare these system-derived parse trees against the human-assigned error code tags for those same sentences to determine which tree is the closest match to human judgment. To do this, each tree is traversed and all constituents corresponding to mal-rules are recorded as the equivalent error code tag. The competing lists of errors are then compared against the sequence determined by the human coder via a string alignment/comparison algorithm which we discuss in (Michaud et al., 2001).</Paragraph>
    <Paragraph position="4"> With the &amp;quot;correct&amp;quot; parse trees indicated for each sentence, we will know which grammar constituents each writer correctly executed and which others had to be parsed using our mal-rules. The same statistical techniques described above can then be applied to form performance profiles for capturing statistically significant differences in the grammar rules used by students within each level. This will give us a much more detailed 5Thanks are due to Greg Silber for his work on revising our interface and creating this variation.</Paragraph>
    <Paragraph position="5"> description of acquisition status on language elements throughout the spectrum represented by our sample population.</Paragraph>
    <Paragraph position="6"> The implication of having such information is that once it is translated into the structure of our SLALOM user model, performance on a previously-unseen structure may be predicted based on what performance profile the user most closely fits and what tag that profile typically assigns to the structure in question; as mentioned earlier in this text, features typically acquired before a structure on which the user has demonstrated mastery can be assumed to be acquired as well. Those structures which are well beyond the user's area of variable performance (his or her current area of learning) are most likely unacquired. Since we view the information in SLALOM as projecting probabilities onto the rules of the grammar, intuitively this will allow the user's mastery of certain rules to project different default probabilities on rules which have not yet been seen in the user's language usage.</Paragraph>
    <Paragraph position="7"> With this information, ICICLE will then be able to make principled decisions in both parsing and tutoring tasks based on a hybrid of direct knowledge about the user's exhibited proficiency on grammatical structures and the indirect knowledge we have derived from typical learning patterns of the population.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML