XML Viewer - w97-1007

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1007_metho.xml
Size: 14,661 bytes
Last Modified: 2025-10-06 14:14:44
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1007">
  <Title>From Psycholinguistic Modelling of Interlanguage in Second Language Acquisition to a Computational Model</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Using Corpus Linguistics in order
</SectionTitle>
    <Paragraph position="0"> to Model Interlanguage In this section we will explain first the experimental method used in order to collect written material, and second the top-down methodology used in order to model interlanguage in support of the selected corpora. We model Interlanguage Level Models by means of automatic tools which use the collected material as input. It must be noted that some information of the interlanguage models, for example text conditions (see section 2), are detected semiautomatically with the help of the teachers and learners. null Before explaining modelling based on corpus analysis, we would like to make some comments about the criteria for defining the corpus: we collected written material from different language schools (IRALE 1, ILAZKI) and grouped this material depending on some features of the texts such as, 1) the kind of exercise proposed by the teacher (e.g.</Paragraph>
    <Paragraph position="1"> abstract, article about a subject, letter ...) and 2) the student who wrote the text. Those are students with a regular attendance in classes and with different characteristics and motivations for learning Basque (e.g. different learning rates, different knowledge about other languages, mother tongue . .. ). We codified the texts of the corpora following a prefixed notation (e.g. ill0as) showing the language school (e.g. il, ILAZKI), the language level, the learner's code, and the type of exercise proposed (e.g. s, summary). The last feature is what we have called text condition in section 2. At the same time, a database for gathering the relevant information about the students' learning process was developed. We retrieved such information from interviews with the students and the teachers (Andueza et al., 1996). The corpus collected from 1990 to 1995 is made up of 350 texts. This corpus has been divided in subsets depending on the language level. At the moment we have defined three language levels of study that we call low, intermediate, and high levels.</Paragraph>
    <Paragraph position="2"> Before designing and implementing the automatic tools, three different studies of corpora during 90/91 (i.e. 50 texts semiautomatically analysed), 93/94 (i.e. 20 texts semiautomatically analysed), and 94/95 (i.e. 100 texts semiautomatically analysed) were carried out. These studies were done, in the first case, by teachers who didn't know the students, and in the other two cases by teachers who knew the students. In the first two cases the work lasted two months. In the third case, however, texts were collected every week from September until June, and two teachers worked five hours per week on studying the corpora during the 94/95 academic year.</Paragraph>
    <Paragraph position="3"> The language learners had five hours of language classes per week, and they wrote one composition every week or every fortnight.</Paragraph>
    <Paragraph position="4"> For modelling interlanguage at different language levels we use a top-down methodology, that is, we start from the modelling of high levels and continue to lower ones (see Fig 1). The reasons for 11RALE and ILAZKI: schools specialised in the teaching of Basque Martxalar, Diaz de Ilarraza ~ Maite Oronoz 54 Computational Model off Interlanguage a top-down methodology are that most computational tools for Basque we have (lemmatiser, spelling checker-corrector, morphological disambiguator... ) can be easily adapted for high language levels; besides, usually computational tools for analysing written texts of high language levels are more robust than those of low levels and, finally, there usually is more written material at high levels than at low ones.</Paragraph>
    <Paragraph position="5"> We have automatically analysed subsets of corpora in intermediate and high levels. Choosing a text as a unit of study, groups of sixteen texts have been deeply and automatically studied.</Paragraph>
    <Paragraph position="6"> The steps we followed using the tools we have adapted in order to build the interlanguage model for each N language level were:  i. Design of the lexical database for the Nth language level.</Paragraph>
    <Paragraph position="7"> 2. Selection of the corpus (CORPUS-N) and sub null sets of CORPUS-N to be used in the next steps.</Paragraph>
    <Paragraph position="8"> This selection is based on the criteria, explained before, for collecting material.</Paragraph>
    <Paragraph position="9">  3. Definition of the morphology and morphosyntax based on a subset of CORPUS-N.</Paragraph>
    <Paragraph position="10"> 4. Identification of the fixed knowledge and the variable knowledge, considering the contexts defined in section 2.</Paragraph>
    <Paragraph position="11"> (a) Evaluation of the reliability of the model using other subsets of CORPUS-N.</Paragraph>
    <Paragraph position="12"> (b) Evaluation of the results by a language teacher of N level.</Paragraph>
    <Paragraph position="13">  For example, in studies of high language modelling, a teacher evaluated the results at word level, that is, the type of rules detected and the contexts where they were applied. The evaluation was successful, even though in some cases the perception of the teacher was not the same as the results inferred from the automatic study of the corpora (e.g. in the opinion of the teacher the students are used to deleting the h letter more usually than adding it. This phenomenon has not been detected in the results of the corpus).</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Implementation
</SectionTitle>
    <Paragraph position="0"> We have adapted tools previously developed in our computational linguistic group during the last ten years. These tools are: the Lexical Database for Basque (EDBL) (Agirre et al., 1994), the morphological analyser based on the Two-Level Morphology (Agirre et al., 1992), the lemmatiser (Aldezabal et al., 1994) and some parts of the Constraint Grammar for Basque (Alegria et al., 1996) (Karlsson et hi., 1995).</Paragraph>
    <Paragraph position="1"> We have two main reasons for adapting these tools: . Some of the deviant linguistic structures used by second language learners are different to those native Basque speakers use. The context of application, i.e. structure conditions at word level of some rules, are not the same in both cases. Moreover, we need to add some new rules to have one rule for each linguistic structure, and we also need some of these rules in order to detect deviant linguistic phenomena, e.g. loan words from Spanish (see section 4.1).</Paragraph>
    <Paragraph position="2"> . In the original tools, the context of application of the rules remained ambiguous. As we explained in section 2, the context of application is important to us for modelling the grammatical competence of the students, so we disambiguate such contexts by means of our adapted tools (see section 4.2). In the figure below we can see a scheme of the way in which we have used these adapted tools (Diaz et al., 1997):  (Based on the number of rules in each interpretation). null</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
CONTEXT BASED DISAMBIGUATOR = Dis-
</SectionTitle>
    <Paragraph position="0"> ambiguator for each language level.</Paragraph>
    <Paragraph position="1"> (Based on subsets of the Constraint Grammar for Basque + disambiguation rules based on the context of application).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Redesigning the automata
</SectionTitle>
      <Paragraph position="0"> As we said above, the original morphological analyser was based on the Two-Level Morphology. The implementation of the morphophonological rules were made by the automata. In the original analyser we had 30 automata: 11 of them were used for analysing standard words, but their activation was never detected; the other 19 which represented a deviation were identified but the type of deviation remained unknown. Moreover, the context of application remained ambiguous.</Paragraph>
      <Paragraph position="1"> In the adapted analyser we find 59 automata, which represent 59 different types of phenomena (these are codified as we can see in the table below). We have modified the automata of the morphological analyser in order to detect which rules have been applied and the contexts (structure conditions) where they have been activated. We have also made some changes in the module of the analyser which recognised the automata. The number of automata has increased due to the addition of new rules for detecting new deviations of language learners and to the division of some original automata in others that detect, in a more specific way, some morphologicM phenomena which are very interesting for the study of second language acquisition. An example will illustrate this fact: Rules in the original analyser</Paragraph>
      <Paragraph position="3"> (Rule for deviant phenomenon) The application of the rule for standard phenomenon is not detected, however, the competence error represented in the rule for deviant phenomenon is identified even though the context of application remains ambiguous\] Rules in the adapted analyser  (LEDBH: Delete the H LEtter at the Beginning of the root) All three rules of the interlanguage and the context of application are detected when they have been activated. We repeat the same automata three times and mark as negative in each automaton the states which correspond to the activation of the rule.\]</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Rules in their Linguistic Context
</SectionTitle>
      <Paragraph position="0"> Experiments and interviews with experts lead us to see the need of identifying the linguistic context where a morphological rule (for standard or deviant phenomenon) is applied. We identify this context by adding to the adapted morphological analyser (Learner_Analyser) some characteristics such as the place (lemma/morpheme) where the rule is applied, the length of the word and the type of the last letter (vowel/consonant) of the root (Postprocess).</Paragraph>
      <Paragraph position="1"> We have two main aims in mind: 1. Disambiguate unlikely interpretations of a word (Context..Based_Disambiguator).</Paragraph>
      <Paragraph position="2"> There are two ways to do this: * Discarding interpretations in which a morphological rule has been applied in a part of the word (lemma/morpheme) where it never appear in real life examples. For example, the deviant word *analisis has two interpretations (analisi/analisiz): the rule to add an s at the end of the lemma / the replacement of z by s in the morpheme. The second interpretation is not possible for high language level students, so we discard it at such level.</Paragraph>
      <Paragraph position="3"> * Discarding interpretations in which a morphological rule is applied within an unusual part of speech. The rule that detects the replacement of t by d is a good example of this. The rule is never used in verbs starting with d. After discarding all interpretations of the words, where the replacement rule has been applied and the part of speech is a verb, the number of interpretations in the analysis of the word is reduced to a half.</Paragraph>
      <Paragraph position="4"> Martxalar, Diaz de Ilarraza ~ Maite Oronoz 56 Computational Model of Interlanguage 2. Refine the model of the student's Interlanguage (Language_Level_Based_Disambiguator).</Paragraph>
      <Paragraph position="5"> A word changes into another one quite different as a result of the application of an excessive number of rules who represent deviant phenomena. From the study of the corpora, we can determine the exact number of possible deviation rules for an interpretation that makes sense at each language level. At the moment, we have determined it for some levels (i.e. the highest) and we are working on the others.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 The Output of the Modelling Tool
</SectionTitle>
    <Paragraph position="0"> In this section we will show an example of an inter-language structure in order to see the relationship between the output of the modelling tool and the description of interlanguage structures explained in the conceptual model.</Paragraph>
    <Paragraph position="1"> In this example we will see a detected linguistic phenomenon in the corpus of high level learners. The description of the phenomenon is: &amp;quot;when learners want to construct relative clauses and the last letter of the verb is t, for example dut (auxiliary verb for composed verbs), when adding the suffix -.._n for constructing relative clauses, the t is replaced by d and the a letter is added. That is, dut + -n = dudan&amp;quot;.</Paragraph>
    <Paragraph position="2"> e.g. Ikusi dudan haurra atsegina da. 'The boy who I have seen is nice.' Ikusi duda_n haurra atsegina da.</Paragraph>
    <Paragraph position="3"> I have seen who boy the nice is.</Paragraph>
    <Paragraph position="4"> This example shows that Basque syntactic information is found inside the word. That is why in the modelling of the LC_I linguistic condition (see the example) the REL feature (relative clause) is at word level, and not at sentence level.</Paragraph>
    <Paragraph position="5"> An example of the output of the modelling tool:  (description &amp;quot;rule applied in the lemma&amp;quot;)) Martxalar, DCaz de Ilarraza ~ Maite Oronoz 57 Computational Model of Interlanguage In the process of modelling, first, we identify the linguistic rules, second, we detect groups of linguistic rules which occur in the same context and define the linguistic phenomenon, and last, the interlanguage structure is identified.</Paragraph>
    <Paragraph position="6"> If we compare this output with the conceptual model given in section 2, we can see how the information needed is reached automatically, except the description of the linguistic phenomenon and the interlanguage structure (see question marks in the example). Such information will be completed by the psycholinguist who will use the ICALL system.</Paragraph>
    <Paragraph position="7"> The information given by the psycholinguist will be reused in future modellings.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML