XML Viewer - c69-1601

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/69/c69-1601_metho.xml
Size: 18,607 bytes
Last Modified: 2025-10-06 14:11:05
<?xml version="1.0" standalone="yes"?>
<Paper uid="C69-1601">
  <Title>COMPUTATIONAL ANALYSIS OF INTERFERENCE PHENOMENA \]ON THE LEXICAL LEVELtI&amp;quot;</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
COMPUTATIONAL ANALYSIS OF INTERFERENCE PHENOMENA
\]ON THE LEXICAL LEVELtI&amp;quot;)
</SectionTitle>
    <Paragraph position="0"> W. Skalmowski and M. Van Overbeke Summary This contribution presents the results of comparison of Dutch texts written by bilinguals I) (speaking French and Dutch), with Dutch texts regarded as STANDARD WRITTEN DUTCH. The attention was focussed on French loan-words appearing in both types of texts and the differences in their use. Certain generalizations as to the mechanisms of interference are suggested.</Paragraph>
    <Paragraph position="1"> I. Mater~ls The materials used for the present contribution belong to  two groups : r Group A : texts written by francophones with ca. 6 years of Dutch training. These texts represent what we call Francophone Written Dutch (below FWD). - group B : Texts from recent contemporary Dutch literature by both Dutch and Flemish authors. They will here represent Standard Written Dutch (SWD).</Paragraph>
    <Paragraph position="2"> ============================================================== (~ We are greatly indebted for the assistance of oul colleages Mr.L.DE BUSSCHERE, who prepared all computer programs needed in this investigatfon, Mr.R.EECKHOUT, who helped us with many suggestions as to the possibilities of information processing techniques and with critical remarks concerning the linguistic aspects of our problem, and - last but not least - the Direction of the MATHEMATICAL CENTRE of the University of Louvain, who put at our disposal the IBM-360 computer. null</Paragraph>
    <Paragraph position="4"> The texts of group A were written by 400 francophone 18 year-old pupils in the highest classes at the 61 private secundary schools in Brussels and its suburbs. This sample represents one fifth of the total population. From every pupil we obtained two Dutch compositions, one of them a piece of homework written in November \]967, another an examination composition from December of the same year. The reasons for this choice are evident, since the pupils can call in their parents' and their dictionaries' assistance in the first situation but not in the second.</Paragraph>
    <Paragraph position="5"> From every composition the first 125 words were put on punchcards together with coded information as to their source. In this way a corpus of ca. 100,000 words was compiled. In order to allow for comparison of relative parameters such as wordspread, vocabulary-growth etc., it was later divided into two parts each containing ca. 50,000 words (parts I and 2 below).</Paragraph>
    <Paragraph position="6"> The texts of group B, i.e. the SWD, were obtained by putting together extracts from literary work by I0 contemporary authors.</Paragraph>
    <Paragraph position="7"> This anthology gave us a corpus of some \]O,0OO words.</Paragraph>
    <Paragraph position="8"> The first part of group A reflects ca. 50 different subjectmatters, whereas the SWD-anthology reflects only \]O subject-matters or &amp;quot;themes&amp;quot;. So the disproportion of corpora is outweighed by a themes/tokens ratio which is I/ 10 in both corpora. In order to estimate the influence of subject-matter on word-choice and especially on the rate of vocabulary-growth, a comparison was made between the |O-author-corpus and a fragment of ca. 10,0OO words from one single author. The results show that the vocabulary-growth remains almost unchanged, i.e. that the diversity of subject-matters does not substantially influence the numerical values of growth rate (fig. l). In accordance with this result, we suggested that each of these texts (groups A and B) be regarded as written by one single person.</Paragraph>
    <Paragraph position="10"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
COMPARISON OF
VOCABULARY GROWTH
10 themes:
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"/>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Lexical ~ter~rence
</SectionTitle>
    <Paragraph position="0"> The main purpose of this contribution is to test and verify certain non-computational insights made about language interference in general. Dutch presents a very poignant example of this phenomenon since its vocabulary contains a very large number of French loan- and foreign words and there is still an &amp;quot;open door&amp;quot; allowing the intrusion of lexieal gallicism in practically unlimited quantity. Thus the Dutch vocabulary holds a lot of parallel lexemes of both origins, e.g. analyse~ ontleding, fenomeen/verschijnsel, decep~ie/ontgoocheling etc.</Paragraph>
    <Paragraph position="1"> This situation strongly resembles that of English with its Anglo-Saxon and Romance words, although the semantic differentiation of such word-palrs seems to have progressed much more there. Whereas the native Dutch speaker plays both keys with an unbiased ease, for the Belgian francophone this ambiguous situation produces certain constraints and difficulties, which have visible effects on word-choice, growth rate of foreign words and vocabulary size in general.</Paragraph>
    <Paragraph position="2"> For reasons of simplicity our investigation did not adopt the usual distinction between loan-words and foreign words since this is based on the different degrees of integration of foreign lexemes, measured by differences in pronunciation, social acceptability within the speaking community and certain prescriptive arrangements such as their inclusion in vocabularies and dictionaries, whose authority is generally accepted. As the aim of our investigation was to find ways of providing numerical values for interference phenomena, we proceeded in a purely descriptive way, using only etymological criteria to distinguish between original and foreign lexical elements.</Paragraph>
    <Paragraph position="3"> Thus we considered units containing either lexical or morphological elements, or both, as loan-words. So bonjouPen with its French lexical element was entered, but also trotser#n because of its French word-fromational part. Composita containing only one foreign element (e.g. avondto~let) were treated as loan-words unless this element had already been entered as an autonomous word. No distinction was made between foreign words included in the Standard Dutch Vocabulary of van Dale ,(e.g.</Paragraph>
    <Paragraph position="4"> assaut) amd those which are not mentioned there (e,g.auberge), both examples occurring in our investigation materials. Since the computer program did not provide for lemma-like items, all different morphological forms and derivations of words were regarded as different types; thus expresoie, expressief, expressionlst etc. are counted as different items. Also for reasons of simplicity all non-French foreign words are relegated here to the category of pure Dutch items.</Paragraph>
    <Paragraph position="5">  3. Lexical mter~renceand word-~ngth  As a first approximation test the percentage of foreign words in the vocabulary in both FWD- and SWD-texts was established.</Paragraph>
    <Paragraph position="6"> The results are as follows :</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="8" type="metho">
    <SectionTitle>
TOKENS TYPES ZogTYPES
</SectionTitle>
    <Paragraph position="0"> 141 O.4285 5.38 The difference of foreign vocabulary ratio in both groups results in distributional differences of words of diverging letter-number. Though the overall word-length of tokens in both groups is nearly identical (4.5\] for SWD and 4.61 for FWD) en application of the chi-square test proved the divergences of word distribution (words belonging to different wordclasses) to be highly significant. The average word-length of types (M) is different in both groups :</Paragraph>
    <Paragraph position="2"> As the pronunciation of French words is in most cases adapted to the Dutch ones (and this is reflected in the orthography), it was not plausible to suppose that this divergence was due noly to the proportional difference of foreign words. It was found that the divergence was partly due tO the use of eomposita in FWD; their distribution differs considerably from SWD. This is strikingly evident for word-length |0 (fig.2) The fact that FWD-authors Would &amp;quot;switch in&amp;quot; this Dutch-formational device in cases where the Dutch native speaker does not, shows that francophones are &amp;quot;over-aware&amp;quot; of this means of translating the French genitive construction by a Dutch compositum (e.g. pot de ~eu~8&gt; bloempot). This fact strengthens the assumption made in this paper, that the lexieal level of language is very closely connected with higher (syntactic) levels, so that statistically statable facts may be explained only in connection with certain more general models of speech production.</Paragraph>
    <Paragraph position="3">  The interference model presented here consists of two parts : the syntactic one, containing also the word-formational devices, which may be thought of as a generative device of the kind described by N.CHOMSKY and other generativists; the second part, called the lexical morpheme store, is thought of as consisting of entries &amp;quot;written down&amp;quot; in terms of conceptual symbols, provided with actual linguistic interpretations. These &amp;quot;interpretations&amp;quot;, which in a very simplified manner may be identified with words tout court, are picked out of the store and &amp;quot;fitted&amp;quot; into previously constructed sentence forms. In other words, we assume that the sentences are formed according to semantic requirements before the actual words have been chosen. This last routine goes on in a semi-automatic way, which may be visualized as picking the required lexemes - according to the entries in terms of conceptual symbols - out of a magnetic tape gliding under a reading device of some sort.</Paragraph>
    <Paragraph position="4"> For the case of a bilingual speaker, we can imagine the procedure as a tape with three different tracks, the middle one contain~ng the &amp;quot;entries&amp;quot; , the other two the respective actual morphemes, in casu Dutch and French (D and F in fig. 3). Speaking in one of the two languages demands a switch-over to one of the external tracks. It may be assumed that, in the case of a monolingual Dutch speaker, the cells contain the parallel French and Dutch words in an unordered manner, whereas with a francophone a bias exists towards the French loan-word (e.g. column \] on fig.3 : ph~nomC/ne &gt; fenomeen (verschijnsel)). This explains the predilection for loan-words even within the limits of the &amp;quot;basic vocabulary&amp;quot; and the more so with words of low frequency. Other variants of speech production behavior are possible; for instance the hypercorrect option \] ~ 1 in column 2, where the speaker consciously reaches for the more distant lexeme, and the case of pure borrowing, which may be conceived of as an automatic switch-over to the French side, wherever the Dutch track is blank or whenever the bilingual's competence fails to furnish a good Dutch word or synonym. In this process the French lexeme is placed in the cell on the Dutch side (cf.column 3 where ~ is the lacking word).</Paragraph>
    <Paragraph position="6"> We assume that the word-formational rules belong to the syntactical part. Thus the reshaping of new French borrowings (cf. the loan-adjective gebagancerd composed of the French bagan~ , whose counterpart is lacking in the Dutch track, and of two Dutch affixes ge- and -d) is done in the grammatical part of our model. As a matter of fact, this assumption is a heuristic over-simplification, because certain grammatical morphemes are in fact borrowed, cf. the endings -eren, -at~e, -age etc. In order to explain this phenomenon, one could argue on the fact that in many cases whole word-items are introduced to the lexical store and activate the analogy mechanism, but this problem would lead us beyond the scope of the present investigation.</Paragraph>
    <Paragraph position="7"> A code-swit~ing th~ry There has been much speculation about the possible principle of lexeme order in the store, some ordering being a necessary condition of efficient re-coding. Much discussion, too, has  been devoted to the so-called ZIPF-Iaw 3) . The most convincing explanation was that suggested by HERDAN 4), namely that an ordering of items by decreasing frequency would diminish the number of operations necessary to identifie a given item. &amp;quot;Let us ... assume that the arrangement of the entries is systematic according to frequency of occurence in descending order of frequency, so that the most frequent word has rank I, the second most frequent word rank 2, and so on. If in such a dictionary, that is one in which words are arranged in order of decreasing frequency and increasing order of rankj the look-up procedure is one of successive comparison, the word of rank r will require r look-up operations~ and since this word occurs - the Zipf-law assumed -C/r times, the total number of look-up operations required to locate a word is C (the constant in the Zipf-law, formulated as r.fr= C ). Thus for n words contained in the dictionary, nC look-up operations will be required. On the other hand, we know that for the Zipf-law the total number of occurences (the text length in terms of word number) and thus the total number of words to be searched, is given by I~Crdr = C lo~ n N It follows that the average number of look-up operations per word is An= nC/C lo~ n = n/lo~ n (...) This compares favourably with the n/2 look-up operations which would be needed under the scheme described above, which makes no use of the frequency element.&amp;quot; ) Within the framework of our model it would mean that the winding and unwinding of the tape takes considerably less time t~a~L in the case of wholly random distribution. The question remains of what principle underlies the differentiation of item possibility. Here too, the concept of &amp;quot;pigeon-holing&amp;quot; o~ semantic information proposed by HERDAN 4) seems to be the most plausible. In other words, the &amp;quot;conceptual symbols&amp;quot; do not represent separate pieces of the univers de disoour8 taken at random, but are probably ordered by some classificational system, resembling the biological classification.</Paragraph>
  </Section>
  <Section position="5" start_page="8" end_page="8" type="metho">
    <SectionTitle>
6. Word content and entropy
</SectionTitle>
    <Paragraph position="0"> To test this hypothesis we divided the FWD material into three frequency-classes (group I: absolute frequency \], group II: frequency 2 and 3, group III : frequency above 3) and examined the samples of these groups according to their distribution within the classificational system applied by L.BROUWERS in his Dutch thesaurus HET JUISTE WOORD 3). The supposition was that in the event of ordering of some kind, the distribution of items among the &amp;quot;content classes&amp;quot; in the thesaurus (expressed as entropy and redundancy) would be different for various frequency groups, and further, that in the event of the &amp;quot;pigeon-holing&amp;quot; suggested by HERDAN, the redundancy should increase for groups of items with higher frequencies. Such an increase was in fact observed, as the reader can conclude from the following table:  Thus it seems that some &amp;quot;natural order&amp;quot;, reflecting a classification of concepts according to their content, is at least one of the causes differentiating the, relative frequencies of words. This result is compatible with the fact that the diversity of subject-matter (cf.\]) does not considerably alter the growth rate of vocabulary. This statement need not  rule out other devices allowing quick interconnections between words belonging to the same content-group but differing in frequency; (cf. the so-called association of related concepts suggested by P.A.KOLERS @). However, the basic principle of order seems to be of a statistical kind, as is proved by the perfect fit of the rank-frequency distribution with the theoretical distribution according to the ZIPF-MANDELBROT formulation (of.fig.4). The correlation coefficient between the observed and the theoretical distribution is 0.993!</Paragraph>
  </Section>
  <Section position="6" start_page="8" end_page="13" type="metho">
    <SectionTitle>
Z Consequences
</SectionTitle>
    <Paragraph position="0"> The assumed model haslconsequences, which have been empirically tested: I. The assumed model, and especially the process of bZankfiZlPSng of the Dutch track with French morphemes, presupposes that in general the FWD-writers will use a greater number of foreign words than the SWD allows. This fact is already apparent from the overall percentage of foreign elements in FWD (cf. fig.5) In particular the foreign words should appear more frequently in proportion to the increase of text-length ) . The investigation of vocabulary growth rate has in fact shown that this is the case : the ratio of new foreign words to the total vocabulary remains stable (ca. \]/ lO) until a vocabulary of 3,000 items is reached. Thereafter it increases considerably. The sample described as Part 2 (fig. 5) containing ca.50,O00 words, has not been pre-edited; i.e. no orthographic mistakes or ommissions have been eliminated, as it was done manually in Part I. Thus all orthographic idiosyncrasies have been counted as new types by the computer. We assume that the difference in the size of the so-called basic vocabulary (3,000 -- 3,500) is mainly due to this fact.</Paragraph>
  </Section>
  <Section position="7" start_page="13" end_page="13" type="metho">
    <SectionTitle>
L
</SectionTitle>
    <Paragraph position="0"> 2. As the choice of lexemes from the store takes place in terms of &amp;quot;conceptual symbols&amp;quot;, the lexical diversity should not be substantially diminished on account of the limited vocabulary. The blank-fillings with French lexemes should allow the francophones to keep the overall diversity on a normal level, i.e. on that of the SWD-writers. In other words, we suppose that the greater number of foreign elements in FWD-texts is the consequence of the endeavor to &amp;quot;keep in pace&amp;quot; with the normal rate of language diversity.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML