Proceedings of the Workshop on Linguistic Distances, pages 82–90,
Sydney, July 2006. c©2006 Association for Computational Linguistics
A Measure of Aggregate Syntactic Distance
John Nerbonne and Wybo Wiersma
Alfa-informatica, University of Groningen
P.O.Box 716, NL 9700 AS Groningen, The Netherlands
j.nerbonne@rug.nl & wybo@logilogi.org
Abstract
We compare vectors containing counts of
trigrams of part-of-speech (POS) tags in
order to obtain an aggregate measure of
syntax difference. Since lexical syntactic
categories reflect more abstract syntax as
well, we argue that this procedure reflects
more than just the basic syntactic cate-
gories. We tag the material automatically
and analyze the frequency vectors for POS
trigrams using a permutation test. A test
analysis of a 305,000 word corpus con-
taining the English of Finnish emigrants
to Australia is promising in that the proce-
dure proposed works well in distinguish-
ing two different groups (adult vs. child
emigrants) and also in highlighting syntac-
tic deviations between the two groups.
1 Introduction
Language contact is a common phenomenon
which may even be growing due to the increased
mobility of recent years. It is also linguistically
significant, since contact effects are prominent
in linguistic structure and well-recognized con-
founders in the task of historical reconstruction.
Nonetheless we seem to have no way of assaying
the aggregate affects of contacts, as Weinreich fa-
mously noted:
“No easy way of measuring or charac-
terizing the total impact of one language
on another in the speech of bilinguals
has been, or probably can be devised.
The only possible procedure is to de-
scribe the various forms of interference
and to tabulate their frequency.” (Wein-
reich, 1953, p. 63)
This paper proposes a technique for measuring the
aggregate degree of syntactic difference between
two varieties. We shall thus attempt to measure
the “total impact” in Weinreich’s sense, albeit with
respect to a single linguistic level, syntax.
If such a measure could be developed, it would
be important not only in the study of language con-
tact, but also in the study of second-language ac-
quisition. A numerical measure of syntactic dif-
ference would enable these fields to look afresh at
issues such as the time course of second-language
acquisition, the relative importance of factors in-
fluencing the degree of difference such as the
mother tongue of the speakers, other languages
they know, the length and time of their experience
in the second language, the role of formal instruc-
tion, etc. It would make the data of such studies
amenable to the more powerful statistical analysis
reserved for numerical data.
Naturally we want more than a measure which
simply assigns a numerical value to the difference
between two syntactic varieties: we want to be
able to examine the sources of the difference both
in order to win confidence in the measure, but also
to answer linguistic questions about the relative
stability/volatility of syntactic structures.
1.1 Related Work
Thomason and Kaufmann (1988) and van Coet-
sem (1988) noted, nearly simultaneously, that the
most radical (structural) effects in language con-
tact situations are to be found in the language of
SWITCHERS, i.e., in the language used as a second
or later language. People MAINTAINING their lan-
guage tend to adopt new lexical items from a con-
tact language, but this only has structural conse-
quences as the lexical items accumulate. Thus we
hear radically different English used in immigrant
82
communities in the English-speaking world, but
the natives in contact with these groups do not tend
to modify their language a great deal. This sug-
gests that we should concentrate on those switch-
ing as we begin to develop measures of aggregate
difference.
Poplack and Sankoff (1984) introduced tech-
niques for studying lexical borrowing and its
phonological effects, and and Poplack, Sankoff
and Miller (1988) went on to exploit these ad-
vances in order to investigate the social conditions
in which contact effects flourish best.
We follow Aarts and Granger (1998) most
closely, who suggest focusing on tag sequences in
learner corpora, just as we do. We shall add to
their suggest a means of measuring the aggregate
difference between two varieties, and show how
we can test whether that difference is statistically
significant.
2 Syntactic Footprints
In this section we justify using frequency profiles
of trigrams of part-of-speech (POS) categories as
indicators of syntactic differences. We shall first
automatically tag second-language speakers’ cor-
pora with syntactic categories:
Oh that ’s a just a
INT PRON COP ART EXCL ART
fun in a ’ Helsinki
N-COM PREP ART PAUSE N-PROP
We then collect these into overlapping triples (tri-
grams). The tag-trigrams include triples such as
INT-PRON-COP and PRON-COP-ART.
We consider three possible objections to pro-
ceeding this way. First, one might object that un-
igrams, bigrams, also should be compared. We
are in fact sympathetic to the criticism that n-
grams for n negationslash= 3 should also be compared, at
least with an eye toward refining the technique,
and we have performed experiments with bigrams
and with combinations ofn-grams for largern, but
we restrict the discussion here to trigrams in or-
der to simplify presentation. Second, our choice
of part-of-speech categories may bias the results,
since other research might use other POS cate-
gories, and third, that POS trigrams do not reflect
syntax completely. We first develop these last two
objections further, and then explain why it is rea-
sonable to proceed this way.
Ideally we should like to have at our disposal
the syntactic equivalent of an international pho-
netic alphabet (IPA, 1949), i.e. an accepted means
of noting (an interesting level of) syntactic struc-
ture for which there was reasonable scientific con-
sensus. But no such system exists. Moreover,
the ideal system would necessarily reflect the hi-
erarchical structure of dependency found in all
contemporary theories of syntax, whether directly
based on dependencies or indirectly reflected in
constituent structure. Since it is unlikely that re-
searchers will take the time to hand-annotate large
amounts of data, meaning we shall need automat-
ically annotated data, this leads to a second prob-
lem, viz., that our parsers, the automatic data an-
notators capable of full annotation, are not yet ro-
bust enough for this task. (Even the best score
only about 90% per constituent on edited news-
paper prose.)
We have no solution to the problem of the miss-
ing consensual annotation system, but we wish
to press on, since it will be sufficient if we can
provide a measure which correlates strongly with
syntactic differences. We note that natural lan-
guage processing work on tagging has compared
different tag sets, noting primarily the obvious,
that larger sets result in lower accuracy (Manning
and Sch¨utze, 1999, 372ff.). Since we aim here
to contribute to the study of language contact and
second-language learning, we shall choose a lin-
guistically sensitive set, that is, a large set de-
signed by linguists. We have not experimented
with different tagsets.
With regard to the second objection, the fact
that syntax concerns more than POS trigrams, we
wish to deny that this is a genuine problem for
the development of a measure of difference. We
note that our situation in measuring syntactic dif-
ferences is similar to other situations in which ef-
fective measures have been established. For ex-
ample, even though researchers in first language
acquisition are very aware that syntactic devel-
opment is reflected in the number of categories,
and rules and/or constructions used, the degree
to which principles of agreement and government
are respected, the fidelity to adult word order pat-
terns, etc., still they are in large agreement that
the very simple MEAN LENGTH OF UTTERANCE
(MLU) is an excellent measure of syntactic matu-
rity (Ritchie and Bhatia, 1998). Similarly, life ex-
pectancy and infant mortality rates are considered
reliable indications of health when large popula-
tions are compared. We therefore continue, pos-
tulating that the measure we propose will corre-
83
late with syntactic differences as a whole, even if
it does not measure them directly.
In fact we can be rather optimistic about us-
ing POS trigrams given the consensus in syntac-
tic theory that a great deal of hierarchical struc-
ture is predictable given the knowledge of lexical
categories, in particular given the lexical HEAD.
Sells (1982, §§ 2.2, 5.3, 4.1) demonstrates that
this was common to theories in the 1980’s (Gov-
ernment and Binding theory, Generalized Phrase
Structure Grammar, and Lexical Function Gram-
mar), and the situation has changed little in the
successor theories (Minimalism and Head-Driven
Phrase Structure Grammar). There is, on the other
hand, consensus that the very strict lexicalism
which Sells’s work sketched must be relaxed in
favor of “constructionalism” (Fillmore and Kay,
1999), but even in such theories syntactic heads
have a privileged, albeit less dominant status.1
Let us further note that the focus on POS tri-
grams is poised to identify not only deviant syn-
tactic uses, such as the one given as an exam-
ple above, but also overuse and under-use of lin-
guistic structure, whose importance is empha-
sized by researchers on second-language acquisi-
tion (Coseriu, 1970), (de Bot et al., 2005, A3,B3).
According to these experts it is misleading to
consider only errors, as second language learn-
ers likewise tend to overuse certain possibilities
and tend to avoid (and therefore underuse) oth-
ers. For example, Bot et al. (2005) suggest that
non-transparent constructions are systematically
avoided even by very good second-language learn-
ers).
2.1 Tagging
We tagged the material using Thorsten Brants’s
Trigrams ’n Tags (TnT) tagger, a hidden Markov
model tagger which has performed at state-of-
the-art levels in organized comparisons, achieving
96.7% correct on the material of the Penn Tree-
bank (Brants, 2000).
Since our material is spoken English (see be-
low), we trained the tagger on the spoken part of
the International Corpus of English (ICE) from
Great Britain, which consists of 500k words. This
was suboptimal, as the material we wished to ana-
lyze was the English of Finnish emigrants to Aus-
tralia, but we were unable to acquire sufficient
1One referee suggested that one might test the association
between POS trigram differences and head differences exper-
imentally, and we find this suggestion sensible.
Australian material.
We used the tagset of the TOSCA-ICE consist-
ing of 270 tags (Garside et al., 1997), of which 75
were never instantiated in our material. In a sam-
ple of 1,000 words we found that the tagger was
correct for 87% of words, 74% of the bigrams, and
65% of the trigrams. As will be obvious in the
presentation of the material (below), it is free con-
versation with pervasive foreign influence. We at-
tribute the low tagging accuracy to the roughness
of the material. It is clear that our procedure would
improve in accuracy from a more accurate tagger,
which would, in turn, allow application to smaller
corpora.
We collect the material into a frequency vector
containing the counts of 13,784 different POS tri-
grams, one vector for each of the two sub-corpora
which we describe below. We then ask whether
the material in the one sub-corpus differs signifi-
cantly from that in the other. We turn now to that
topic.
3 Permutation Tests
There is no convenient test we can apply to check
whether the differences between vectors contain-
ing 13,784 elements are statistically significant,
nor how significant the differences are. Fortu-
nately, we may turn to permutation tests in this
situation (Good, 1995), more specifically a per-
mutation test using a Monte Carlo technique.
Kessler (2001) contains an informal introduction
for an application within linguistics.
The fundamental idea in a permutation test is
very simple: we measure the difference between
two sets in some convenient fashion, obtaining
δ(A,B). We then extract two sets at random
from A∪B, calling these A1,B1, and we calcu-
late the difference between these two in the same
fashion, δ(A1,B1), recording the number of times
δ(A1,B1) ≥ δ(A,B), i.e., how often two ran-
domly selected subsets from the entire set of ob-
servations are at least as different as (usually more
different than) the original sets were. If we repeat
this process, say, 10,000 times, thenn, the number
of times we obtain more extreme differences, al-
lows us to calculate how strongly the original two
sets differ from a chance division with respect to
δ. In that case we may conclude that if the two
sets were not genuinely different, then the origi-
nal division into A and B was likely to the degree
of p = n/10,000. In more standard hypothesis-
84
testing terms, this is the p-value with which we
may reject (or retain) the null hypothesis that there
is no relevant difference in the two sets.
We would like to guard against three dangers in
our calculations. First, given the ease with which
large corpora are obtained, we are uninterested
in obtaining statistical significance through sheer
corpus size. We aim therefore at obtaining a mea-
sure that is sensitive only to relative frequency, and
not at all to absolute frequency (Agresti, 1996).
Permutation tests effectively guard against this
danger, if one takes care to judge samples of the
same size within the permutations.
Second, we are mindful of a potential confound-
ing factor, viz., the syntactical intra-dependence
found within sentences (especially between ad-
joining POS trigrams). If we permuted n-grams,
we might in part measure the internal coherence of
the two initial sub-corpora, i.e., the coherence due
to the fact that both sub-corpora use language con-
forming to the rules of English syntax. If we per-
muted n-grams, this coherence would be lost, and
the measurement of difference would be affected.
In the terminology of permutation statistics: the
elements that are permuted must be reasonably in-
dependent. So we shall permute not n-grams, but
rather entire sentences.
Third, the decision to permute sentences rather
than n-grams exposes us to a confound due to sys-
tematically different sentence lengths. While the
result of permuting elements in a Monte Carlo
fashion always results in two sub-corpora that
have the same number of elements as in the base-
case, our problem is that the elements we per-
mute are sentences, while what we measure are
n-grams. Now if the original two sub-corpora dif-
fer substantially in average sentence length, then
the result of the Monte Carlo “shuffling” will not
be similar to the original split with respect to the
number of n-grams involved. The original sub-
corpus with longer sentences will therefore have
many more n-grams in the base-case than in the
random re-drawings from the combining corpora,
at least on average. We address this danger sys-
tematically in the subsection below on within-
permutation normalizations (§3.2).
We note a more subtle dependency we do not
attempt to guard against. Some POS sequences
(almost) only occur in relatively long sentences,
e.g. the inversion that occurs in some condition-
als Were I in any doubt, I should not .... Perhaps
English subjunctives in general occur only in rel-
atively long sentences. If this sort of structure
occurs in one variety more frequently than in an-
other, that is a genuine difference, but it might still
be the reflection of the simpler difference in sen-
tence length. One might then think that the second
variety would show the same syntax if only it had
longer sentences. As far as they are to be con-
sidered a problem in the first place, differences in
syntax that are related to sentence length cannot
be removed by (our) normalizations.
Permutation tests are a very suitable tool for
finding significant syntactical differences, and for
finding the POS trigrams that make a significant
contribution to this difference.
3.1 Measuring Vector Differences
The choice of vector difference measure, e.g. co-
sine vs. χ2, does not affect the proposed tech-
nique greatly, and alternative measures can be
used straightforwardly. Accordingly, we have
worked with both cosine and two measures in-
spired by the RECURRENCE (R) metric introduced
by Kessler (Kessler, 2001, 157ff). Following
Kessler, we also call our measures R and Rsq.
The advantage of the R and Rsq metrics is that
they are transparently interpretable as simple ag-
gregates, meaning that one may easily see how
much each trigram contributes to the overall cor-
pus difference. We even used them to calculate a
separate p-value per trigram.
Our R is calculated as the sum of the differ-
ences of each cell with respect to the average for
that cell. If we have collected our data into two
vectors (c, cprime), and if i is the index of a POS tri-
gram,Rfor each of these two vector cells is equal,
as it is defined simply as R = summationtexti|ci−ci|, with
ci = (ci + cprimei)/2. The Rsq measure attributes
more weight to a few large differences than to
many small ones, and it is calculated: Rsq =summationtext
i(ci−ci)2, with ci being the same as above (for
R).
3.2 Within-Permutation Normalization
Each measurement of difference—whether the
difference is between the original two samples
or between two samples which arise through
permutations—is taken over the collection of POS
trigram frequencies once these have been normal-
ized. We describe first the normalization that is re-
quired to cope with differences in sentence length
85
which we call WITHIN-PERMUTATION NORMAL-
IZATION, as it is applied within each permutation.
In case sub-corpora differ in sentence length,
they will automatically differ in the number of n-
grams across permutations as well. Our Monte
Carlo choice of alternatives does not change the
relative number of sentences across permutations,
but the number of POS trigrams in the groups will
vary if no normalization is applied. Longer sen-
tences give rise to larger numbers of POS trigrams
per sentence, and therefore per sub-corpora. Ap-
plying the within-permutation normalization one
or more times ensures that this does not infect the
measurement of difference.
Protecting the measurement from sensitivity to
differing numbers of POS trigrams per sentence
is for us sufficient reason to normalize, but we
also normalize in order to facilitate interpretation.
We return to this below, in the definition of the
rescaled vectors sy,so.
We thus collect from the tagger a sequence of
counts ci of tag trigrams for each sample. We
treat only the case of comparing two samples here,
which we shall refer to as young (y) and old (o) for
reasons which will become clear in the following
section. We shall keep track of the sum-per-tag tri-
gram as well, summing over the two sub-corpora.
cy = <cy1,cy2,...,cyn > Ny = summationtextni=1cyi
+co = <co1,co2,...,con > No = summationtextni=1coi
c = <c1,c2,...,cn > N(= Ny +No)
= summationtextni=1ci
As a first step in normalization, we work with
vectors holding the relative frequency fractions
per group:
fy = <...,fyi (= cyi/Ny),...>
fo = <...,foi (= coi/No),...>
We note that summationtextni=1fyi = summationtextni=1foi = 1.
We then compute the relative proportions per
trigram, comparing now across the groups. This
prepares for the step which redistributes the raw
trigram counts to compensate for differences in
sentence length.
py = <...,pyi(= fyi /(fyi +foi )),...>
po = <...,poi(= foi/(fyi +foi )),...>
We might also define a sum of py + po:
p = <...,pi(= (poi +pyi) = 1),...>
We do not actually use p below, only py and po,
but we mention it for the sake of the check it al-
lows that pyi +poi = 1,∀i.
We then re-introduce the raw frequencies per
category to obtain the normalized, redistributed
counts Cyn,Con. Note that we use the total count
of the trigram in both samples to redistribute (thus
redistributing these counts based on the trigram to-
tals in both samples):
Cyn = <...,pyi ·ci,...>
Con = <...,poi ·ci,...>
Up to this point the normalization has corrected
for differences in sentence length, or to be more
precise, for differences in the numbers of n-grams
which may appear as a result of permuting sen-
tences. For larger numbers of trigrams the situ-
ation will become: Ny = summationtextni=1cyi ≈ summationtextni=1 Cyi
so that we have effectively neutralized the in-
crease or decrease in the number ofn-grams which
might have arisen due to sentence length. With-
out this normalization a skew in sentence length
in the base case would cause changed, in the
worst case increased, and perhaps even extreme,
significance. During random permutation, where
longer sentences will tend to be distributed more
evenly between the sub-corpora, a disproportion-
ately larger number of n-grams would be found in
the sub-corpus corresponding to the base corpus
with shorter sentences. We have now normalized
so that that effect will no longer appear.
We illustrate the normalizations up to this point
in Table 1. We see already that the overall effect
is to shift mass to the smaller sample. Notice also
that if we were to define C = Cy + Co, then
C = c, since Cy and Co are a redistribution of
c using py and po, whose sum p is 1 under all cir-
cumstances, as was noted above. At the same time
cy negationslash= Cy and co negationslash= Co (if there were differences
in sentence lengths). The values obtained at this
point may be measured by the vector comparison
measure (cosine or R(sq)).
We use this redistributing normalization instead
of just the relative frequency because using rel-
ative frequency would cause trigrams occurring
mainly and frequently in the short-sentence group
to become extremely significant. This is especially
86
Group y Group o Group y’ Group o’
T1 T2 T1 T2 T1 T2 T1 T2
counts c 15 10 90 10 10 10 17 0
rel. freq. f 0.6 0.4 0.9 0.1 0.5 0.5 1 0
norm. prop. p 0.4 0.8 0.6 0.2 0.33 1 0.67 0
trigram ci 105 20 105 20 27 10 27 10
redistrib. C 42 16 63 4 9 10 18 0
Table 1: Two examples of the normalizations applied before each measurement of vector difference.
On the left groups y and o are compared on the basis of the two trigrams T1 and T2. The counts are
shown in the first row, then relative frequencies (within the group), normalized relative proportions, and
finally redistributed normalized counts. The two numbers in boldface in the ‘count’ line are compared
to calculate the underlined relative frequency (on the left) in the ‘relative frequency’ line (in general
counts are compared within groups to obtain relative frequencies). Next, the two underlined fractions of
the ‘relative frequency’ row are compared to obtain the corresponding fractions (immediately below) of
the ‘normalized proportions’ row. Thus relative frequencies are compared across groups (sub-corpora)
to obtain the relative proportions. The trigram count row shows the counts per trigram type, and the
‘redistributed’ row is simply the product of the last two. The second example (on the right) demonstrates
that missing data finds no compensation in this procedure (although we might experiment with smoothing
in the future).
distorting if one calculates the per trigram type p-
value (R or Rsq for a single i).
The normalization does not eliminate all the ir-
relevant effects of differing sentence lengths. To
obtain further precision we iterate the steps above
a few times, re-applying the normalization to its
own output. We are motivated to iterate the pro-
cedure for the following reason. If a trigram is
relatively more frequent in the smaller sub-corpus,
it must then also be relatively less frequent within
the entire corpus (less frequent within the two sub-
corpora together), so there is less frequency mass
to re-distribute for these trigrams than for trigrams
that are relatively more frequent in the larger sub-
corpus (those will be more frequent within the en-
tire corpus). A special case of this are n-grams
that occur only in one sub-corpus. If they oc-
cur only in the larger sub-corpus then their mass
will never be re-distributed in the direction of the
smaller sub-corpus, since zero-frequencies within
one sub-corpus will always result in zero relative
weight (in the current set-up).2 This means that af-
ter normalization the larger sub-corpus will always
still be a bit larger than the smaller one. After one
normalization the effect of these factors is small,
but we can reduce it yet further by iterating the
normalization. This is worthwhile since we wish
2Alternatively, we might have explored a Good-Turing
estimation of unseen items (Manning and Sch¨utze, 1999,
p. 212).
to be certain. After five iterations the relative size-
difference between our normalized sub-corpora is
less than 0.1% for trigrams of the full ICE-tagset
(and even a thousand times smaller for the reduced
tagset). We regard this as small enough to effec-
tively eliminate corpus size differences as poten-
tial problems.
For the purposes of interpretation we also scale
everything down so that the average redistributed
count is 1. We do this by dividing each Cyi,Coi by
N/2n, where N is the total count of all trigrams
and n is the number of trigram categories being
counted. Note that N/2n is the average count of a
given trigram in one of the groups.
sy = cy·2n/N = <...,Cyi ·2n/N,...>
so = co·2n/N = <...,Coi ·2n/N,...>
These values might just as well be submitted to the
vector comparison measure since they are just lin-
ear transformations of the redistributed C values.
The scaling expresses the trigram count as a value
with respect to the total 2n of counts involved in
the comparison, and, since summationtextni=1cyi +summationtextni=1coi =
N,summationtextni=1syi +summationtextni=1soi = 2n. As there are n sorts
of trigrams being compared in two groups, it is
clear that the average value in these last vectors
will be 1.
Similarly, this normalized value will be higher
than 1 for trigrams that are more frequent than av-
erage. Now if we sort the trigrams by frequency—
87
or more precisely, by the weight that they have
within the total R(sq) value, so by their per tri-
gram R(sq) value—we get a listing of the POS
trigrams that distinguish the groups most sharply.
This list can be made even more telling by adding
the raw frequency and a per-trigram p-value. It al-
lows us to directly see significant under and over-
use of POS trigrams, and thereby of syntax.
3.3 Between-Permutations Normalization
The purpose of this normalization is the identifica-
tion of n-gram types which are typical in the two
original sub-corpora. It is applied after comparing
all the results of all the Monte Carlo re-shufflings.
The BETWEEN-PERMUTATIONS NORMALIZA-
TION is similar to the last step of the within-
permutation normalization, except that the linear
transformation is applied across permutations, in-
stead of across groups (sub-corpora): for each
POS trigram type i in each group (sub-corpora)
g ∈{o,y}, the redistributed count Cgi is divided
by the average redistributed count for that type in
that group (across all permutations)Cgi . Note that
the average redistributed count is ci/2 for large
numbers of permutations. The values thus normal-
ized will be 1 on average across permutations.
Trigrams with large average counts between
permutations are those with high frequencies in
the original sub-corpora, and these contribute most
heavily toward statistical significance. The nor-
malization under discussion strips away the role
of frequency, allowing us to see which POS tri-
grams are most (a)typical for a group. We note
additionally that this normalization is useful only
together with information on frequency (or statis-
tical significance). Infrequent trigrams are espe-
cially likely to have high values with respect to
Cgi . For example a trigram occurring only once,
in one sub-corpus, gets the maximum value of
1/0.5 = 2 (as it is indeed very typical for this
sub-corpus), while with a count of one it clearly
cannot be statistically significant (moving between
equally sized sub-corpora with a chance of 50 %
during permutations). So it’s best to calculate
this normalization together with the per trigram p-
values.
4 A Test Case
We tested this procedure on data transcribed from
free interviews with Finnish emigrants to Aus-
tralia. The emigrants were farmers and skilled or
semi-skilled working class Finns who left Finland
in the 1960’s at the age of 25-40 years old, some
with children. Greg Watson of Joensuu Univer-
sity interviewed these people between 1995 and
1998, publishing about his corpus in ICAME 20,
1996 (Watson). He included both interviews with
those who emigrated as adults (at seventeen years
or older) and those who emigrated as children (be-
fore their seventeenth birthday). There are sixty
conversations with adult-age emigrants and thirty
with those who emigrated as children, totaling
305,000 words of relatively free conversation.
It is well established in the literature on second-
language learning that the language of people who
learned the second language as children is supe-
rior to that of adult learners. We will test our idea
about measuring syntactic differences by apply-
ing the measure to the two samples language from
adult vs. child emigrants. The issue is not remark-
able, but it allows us to verify whether the measure
is functioning.
4.1 Results
The two sub-corpora had 221,000 words for the
older group and 84,000 words for the younger
group, respectively. The sentences of the child-
hood immigrants were indeed substantially longer
(27.1 tokens) than those of the older immigrants
(16.3 tokens). So the within-permutation normal-
ization was definitely needed in this case. The
groups clearly differed in the distribution of POS
trigrams they contain (p < 0.001). This means
that the difference between the original two sub-
corpora was in the largest 0.1% of the Monte Carlo
permutations.
In addition we find genuinely deviant syntax
patterns if we inspect the trigrams most respon-
sible for the difference between the two sub-
corpora.
it ’s low tax in here
PRO COP ADJ N/COM PREP ADV
and I was professional fisherman
CONJ PRO COP ADJ N/COM
Both COP-ADJ-N/COM and N/COM-PREP-
ADV accounted for a substantial degree of aggre-
gate syntactic difference. The first pattern nor-
mally corresponds to an error, as it does in the
two (!) examples of it above (there is a sepa-
rate tag for plural and mass nouns). These are
cases where English normally requires an article.
88
Since Finnish has no articles, these are clear cases
of transfer, i.e., the (incorrect) imposition of the
first language’s structure on a second language.
The N/COM-PREP-ADV pattern (corresponding
to the use of in here) is also worth noting, as it
falls into the class of expressions which is not ab-
solutely in error (The material is in here), but it
is clearly being overused in the example above.
Presumably this is a case of hypercorrection from
Finnish, a language without prepositions. We con-
clude from this experiment that the procedure is
promising.
On the other hand there were also problems,
perhaps most seriously with the use of the tags
denoting pauses and hesitations, where we found
that the tag trigrams most responsible for the de-
viant measures in the corpora involved disfluen-
cies of one sort or another. These tended to occur
more frequently in the speech of the older emi-
grants. With the pauses removed (hesitations still
in place) a list of the ten most frequent, significant
trigrams for the older group is shown. Two ran-
dom examples from the corpus are given for each
in Table 2.
We suspect additionally that the low accuracy
rate of the tagger when applied to this material also
stems from the large number of disfluencies.
5 Conclusions and Prospects
Weinreich (1953) regretted that there was no way
to “measure or characterize the total impact one
language on another in the speech of bilinguals,”
(p. 63) and speculated that there could not be. This
paper has proposed a way of going beyond counts
of individual phenomena to a measure of aggre-
gate syntactic difference. The technique may be
implemented effectively, and its results are subject
to statistical analysis using permutation statistics.
The technique proposed follow Aarts and
Granger (1998) in using part-of-speech trigrams.
We argue that such lexical categories are likely
to reflect a great deal of syntactic structure given
the tenets of linguistic theory according to which
more abstract structure is, in general, projected
from lexical categories. We go beyond Aarts and
Granger in Showing how entire histograms of POS
trigrams may be used to characterize aggregate
syntactic distance, in particular by showing how
this can be analyzed.
We fall short of Weinreich’s goal of assaying
“total impact” in that we focus on syntax, but we
1 roadworks and uh
hill and ah
N CONJUNC INTERJEC
2 I reckon it
that take lot
PRON V PRON
3 enjoy to taking
my machine break
INTERJEC PRON V
4 but that ’s
that I clean
CONJUNC PRON V
5 I ’m uh
it ’s uh
PRON V INTERJEC
6 now what what
changing but some
CONJUNC INTERJEC PRON
7 said it ’s
all everybody has
PRON PRON V
8 bought that car
lead glass windows
V PRON N
9 that was different
I was fit
PRON V ADJ
10 Oh lake lake
uh money production
INTERJEC N N
Table 2: The most significant and most frequent
trigrams that were typical for the speech of the
group of older Finnish emigrants to Australia com-
pared to the speech of those who emigrated before
their 17th birthday. The tag trigrams indicating
pauses were removed before comparing the cor-
pora, as these appear to dominate the differences.
The examples illustrating the trigrams were cho-
sen at random, and we note that the examples of
the third sort of trigram involved tagging errors in
the first and second elements of the trigram, and
that other errors are noticeable at the seventh and
eight positions in the list (where ‘said’ and ‘glass’
are marked as pronouns). We reserve the linguistic
interpretation of the error patterns for future work,
but we note that we will also want to filter inter-
jections before drawing definite conclusions.
89
take a large step in this direction by showing how
to aggregate and test for significance, using the
sorts of counts he worked with.
The software implementing the permutation
test, including the normalizations, is avail-
able freely at http://en.logilogi.org/
HomE/WyboWiersma/FiAuImEnRe. It is de-
veloped to allow easy generalization to more than
two sub-corpora and longer n-grams.
Several further steps would be useful. We
should like to repeat the analysis here, eliminat-
ing the effect of hesitation tags, etc. Second,
we should like to experiment systematically with
the inclusion of n-grams for n > 3; to-date we
have experimented with this, but not systemati-
cally enough. Third, we would like to test the
analysis on other cases of putative syntactic dif-
ferences, and in particular in cases where tagging
accuracy might be less an issue.
Acknowledgments
We are grateful to Lisa Lena Opas-H¨anninen,
Pekka Hirvonen and Timo Lauttamus of the Uni-
versity of Oulu, who made the data available and
consulted extensively on its analysis. We also
thank audiences at the 2005 TaBu Dag, Gronin-
gen; at the Workshop Finno-Ugric Languages in
Contact with English II held in conjunction with
Methods in Dialectology XII at the Universit´e de
Moncton, Aug. 2005; the Sonderforschungsbere-
ich 441, “Linguistic Data Structures”, T¨ubingen
in Jan. 2006; and the Seminar on Methodology
and Statistics in Linguistic Research, University of
Groningen, Spring, 2006, and especially Livi Ruf-
fle, for useful comments and discussion. Finally,
two referees for the 2006 ACL/COLING work-
shop on “Linguistic Distances” also commented
usefully.

References
Jan Aarts and Sylviane Granger. 1998. Tag sequences
in learner corpora: A key to interlanguage grammar
and discourse. In Sylviane Granger, editor, Learner
English on Computer, pages 132–141. Longman,
London.
Alan Agresti. 1996. An Introduction to Categorical
Data Analysis. Wiley, New York.
Thorsten Brants. 2000. TnT — a statistical part
of speech tagger. In 6th Applied Natural Lan-
guage Processing Conference, pages 224–231, Seat-
tle. ACL.
Eugenio Coseriu. 1970. Probleme der kontrastiven
Grammatik. Schwann, D¨usseldorf.
Kees de Bot, Wander Lowie, and Marjolijn Verspoor.
2005. Second Language Acquisition: An Advanced
Resource Book. Routledge, London.
Charles Fillmore and Paul Kay. 1999. Grammati-
cal constructions and linguistic generalizations: the
what’s x doing y construction. Language, 75(1):1–
33.
Roger Garside, Geoffrey Leech, and Tony McEmery.
1997. Corpus Annotation: Linguistic Informa-
tion from Computer Text Corpora. Longman, Lon-
don/New York.
Phillip Good. 1995. Permutation Tests. Springer, New
York. 2nd, corr. ed.
1949. The Principles of the International Phonetic As-
sociation. International Phonetics Association, Lon-
don, 1949.
Brett Kessler. 2001. The Significance of Word Lists.
CSLI Press, Stanford.
Chris Manning and Hinrich Sch¨utze. 1999. Foun-
dations of Statistical Natural Language Processing.
MIT Press, Cambridge.
Shana Poplack and David Sankoff. 1984. Borrowing:
the synchrony of integration. Linguistics, 22:99–
135.
Shana Poplack, David Sankoff, and Christopher Miller.
1988. The social correlates and linguistic processes
of lexical borrowing and assimilation. Linguistics,
26:47–104.
William C. Ritchie and Tej K. Bhatia, editors. 1998.
Handbook of Child Language Acquisition. Aca-
demic, San Diego.
Peter Sells. 1982. Lectures on Contemporary Syntactic
Theories. CSLI, Stanford.
Sarah Thomason and Terrence Kaufmann. 1988. Lan-
guage Contact, Creolization, and Genetic Linguis-
tics. University of California Press, Berkeley.
Frans van Coetsem. 1988. Loan Phonology and the
Two Transfer Types in Language Contact. Publica-
tions in Language Sciences. Foris Publications, Dor-
drecht.
Greg Watson. 1996. The Finnish-Australian English
corpus. ICAME Journal: Computers in English Lin-
guistics, 20:41–70.
Uriel Weinreich. 1953. Languages in Contact. Mou-
ton, The Hague. (page numbers from 2nd ed. 1968).
