and satisfactory definition exists, and some linguists 
deny any validity of the word, relegating it to folk 
linguistics. 
Following Greenberg we take words as being composed of morphemes 
so that a word may be identified with a sequence of morphemes and no 
morpheme overlaps two words. From the distribution of the morphemes 
of a corpus we find clusters which approximate the words of the corpus. 
The approximating units are determined relative to the corpus from which 
the distribution is defined. The corpus may be either considered as a 
closed sublanguage in itself or as a sample from some larger corpus. 
We study the behavior of approximate units relative to longer and longer 
portions of the corpus, and also relative to the corpus considered as a 
statistical sample. 
Assuming that a word may be r~presented as a sequence of morphemes, 
how should this sequence be distinguished? In the well-known paper of 
Togeby, (19&9) there is a convenient summry of structural views of the 
word. In his discussion, the word is set forth as a morpheme sequence 
possessing properties classified under the headings of 1 ° Forme libre 
j . i 30 Permutabilite ~. In considering how a ~, 2 ° Seoarabllite, and 
morpheme sequence should be distinguished as a word we will begin by 
examining Togeby f s classifications. 
In Togeby, under the discussion of a word as a forme libre minimum, 
reference is made to Bloomfield's (1933) statement about the word as a 
minimum free form and the ~mallest items which are snoken by themselves, 
in isolation. 
The idea of minimum free form is actually found somewhat earlier 
in Bloomfield ' s (1926) Postulates. 
1 
differ 
and not 
we find 
A minimum free form is a word. A word is thtm'a"form 
which may be utteredalone (with.meaning) but cannot beanalyzed 
into parts that may (all of them) be uttered alone (with meaning). 
Thus the~word ~ can be analyzed into~_~ and z~ but the 
latter part cannot be uttered alone; theword ~can be 
analyzed into wr_wr_it_~ and -er, but the latter cannot be uttered 
alone (the word err he'by virtue of different meaning a 
different form) ... 
Similar views are found in the older "universal grammars." They 
principally in taking the Aristotelian position that the word 
some smaller unit has meaning. For example, in Harris (1771) 
a concern with min,tmumunits of meaning. 
But what shall we say? Hav@ these parts (of a Qu@ntity 
of sound) again other parts, Which~arein like manner 
significant and may be pursued to Infinite? 9an we suppose 
that aliMea~ing~iike.Body, to be divisible; and to include 
within itself other Meanings without end? If this be absurd, 
"' thenmhstw@~ec~ssarliy!~admit, that there is~such a thing as 
a Sound significant, of which no part is of itself significant. 
And thiS:~is ~at we oall'.t.heproper:~haracte~ of a Wo~ For 
thus though the Words (Suu) and (shineth! have each a Meaning, 
yet iethere ce~ainly no-Mem~inginianT~f their Paths, 
neither in the syllables of one, nor the ~etters of the other. 
James Harris refers to Priscian's definition in which the word is defined 
as a minimummeaningful utterance in connected speech. 
' "Dictio @S~ parttime oratlonis constructae, id 
est, in ordine compositae. Pars autem, quantum ad 
' tbtU~inteiiigendum, ~ id' est, a~.~otius~ensus "- 
intellectum. Hoc autem ideo dictum est, nequis 
conetur ix~ in duas partes dividere, hoc est, iU ,~-z. 
Xi et r_~; non enim ad totum intelligendumhaec fit 
Fb~ purpo~e~6fconstruc~ng our model we~eHail~interpret.:minimu TM 
free form as follows: 
• A word'S'iS a sequence ofmubword units. If this sequence may be 
uttered alone, then it is to be expected that the sequence co-occurs 
freely with other sequences. 
Under the classification of separabilite, Togeby places the 
requirement of Jakobson (1938) that words are the separable components 
of phrases : m4nlmal actually separable comuonants of the phrase. 
Conversely, the constituents of a word should not be separable. 
The general requirement of separablllte seems to be that a word 
is a morpheme sequence which may co-occur with other morpheme sequences 
to give granmmtical utterances. If the sequence is a distinct word, then 
its morphemes must be contiguous, and the morphemes of a noncontiguous 
gra,~natical sequence cannot be identified with the same word. 
Under permutabilite I, Togeby quotes HJelmelev(19%3) '_'les mots pourront 
tout s!-~lement ~tre d~finis c~ les signes minima dont l'¢soression, 
J . et de m~eme le contenu, sont recluroauement Dermutables " According to 
Togeby, HJelmslev means that "un changement de l'ordre des roots p0~rra 
entrainer un changement de sens. tandis qu'un chan~ement de l'ordre des 
~rties du roots n'en sera pas capable." 
The requirement here is that if a sequence of morphemes is identified 
with a word, then the order of the sequence must be invariant. 
In Greenberg (1957), the proposed definition of the word based on 
substitution and the recognition of grau~atical sequences, we interpret 
as follows: 
Let S he a sequence of linguistic units and G the class of graummtical 
sequences, in Greenberg's words the class of sequences which "exist as 
expressions in the language." 
Suppose that S~ X A B C D E~G is a morpheme sequence. We want to 
decide whether or not the boundary between B and C is a word boundary. 
3 
To each morpheme of S there corresponds a "nucleus." For the nucleus 
of B to be a word terminal it is necessary that "infinite insertion" of 
nuclei $ possible between B and C, otherwise if there"is a maximum to 
the number of n~ei that can be inserted," the boundary is "intra-word 
boundary." 
Nuclei are classes of morpheme sequences having strongly equivalent 
substitution properties. Some of the conditions for class membership 
are so strict that we would expect the defined classes to be empty for 
the language tak~en as a whole. Perhaps as Chomsky conjectures in a review 
of Greenberg's essay: "It might be that the notion of word may be dsfined 
r~lativ~ to a particularly simpleset of sentences. (1958) 
In practice, Greenberg's conditions might be interpreted as follows: 
S= X A B C D E occurs in the language. The subsequence BC may belong to 
a single word if it is replaceable by a single morpheme and gram~aticality 
is preserved. If for a small number of morpheme sequences Si, the sequer~es 
X A B S i C D E are grammatical, then the subsequence BC belongs to the 
same word. If the sequences X A B S i C D E are granm~tical for a large 
number of Si, then the subsequence BC probably does not belong to the 
same word. 
In an unpublished ~, Juilland develops a constructive definition 
of the word which requires the recognition of gra~naticality. If 
S = X A B C D E~ G is a morpheme sequence, the boundary between B and C 
is classified according to the potential sentence occurrences of B and B. 
Boundaries are classified as "conJunctive%r "disjunctive." Disjunctive 
boundaries isolate potential words called "functional units." Conjunctive 
boundaries occur potentially within words but must be tested by an 
"~nsertion criterion." Thus if BC spans a conjunctive boundary, 
then B is a word boundary if there exists a morpheme sequence S i such 
that X A B S i C D Eg G. 
The Use of Numerical Linguistic Data 
Our object now is to define a quantitative procedure for 
approximating words. Th~ procedure attempts to meet the various 
requirements summarized 21 the last section. Since our interest is 
in distributional methods, we do not want the procedure to include an 
independent test for gra~maticality. 
The requirements that we attempt to fulfill are summarized by 
Julliand as d~ and ~eoarabilit~. These are realized as a con~non 
characteristic in the procedures of Greenberg and Juilland: A potential 
word is isolated as a sequence of morphemes which are associated in 
some special way, then the potential word is tested for its function 
as a word, according to some test of insertion. 
Let us imagine a linguist confronted by the following data. 
Frequency refers to text frequency. Let X A B C D E be a sequence of 
morphemes to be segmented, Consider the boundary between B and C. Is 
this boundary a word boundary? Assume first that B occurs only with 
A, E, C and G, as indicated in Case i. 
Molpheme Pair Frequency Mcrpheme Pair Frequency 
AB 4 BC ' 3 
EB '6 B~ 7 
Case i 
With no further information, we might observe that B occurs more 
frequently with A than with C, and segment as AB CD. Under this 
condition the requirement of adhesion may be met, but a simple consid- 
eration of frequencies is not sufficient to meet the requirement of 
separability. This is illustrated by the hypothetical set of data of 
Case 2. 
Morpheme Pair Frequency 
AB 
EB i 
GB i 
HB i 
IB i 
JB i 
KB i 
Morpheme Pair Frequency 
BC 3 
~F 7 
Case 2 
In this case the frequency of AB also exceeds the frequency of 
BC, but the segmentation AB CD would not agree with linguistic 
intuition at all. In Case 2, B has much greater freedom of combination 
on the left than on the right, and to satisfy the condition of 
separability, at least approximately, we would segment as A BCD° 
In formalizing these intuitions, we refer to the procedure of 
Harris (1955) for grouping phonemes into morphs. Harris assumes that 
an utterance U may be represented as a sequence of phonemes a I a 2 ... a n . 
Let R(al) be the number of different phonemes which may follow the 
phoneme a I in the total language. Similarly, let R(al, a2) be the number 
6 
of different phonemes which may follow al, a 2 and so on. Likewise, let 
L(an) be the number of different phonemes which may precedean, L(an_ 1 an) 
the number which may precede an-1 a n, and so on. Then the sequence 
SR = R(al) R(ala 2) R(ala2a 3) ..o R(al a 2 ... a n) 
describes the freedom of co-occurrence on the right at each phoneme of 
U, and the sequence 
SL = L(a I a 2 o..an) L(a 2 ... a n ) ... L(a n) 
describes th~ freedom of co-occurrence on the left at each phoneme of U. 
Harris observes that morpheme boundaries tend to occur at positions 
in U where t~ corresponding values of R and L are large or attain their 
relative maxima. Thus if R(a I ...ak) is a relative maximum in the 
sequence SR, then a k is a morpheme terminal. Likewise a k is a 
morpheme terminal if R(a I ... ak) exceeds a value comparable to the 
total number of different phonemes in the language. Under similar 
conditions for L(aj ... an) , aj is a morpheme initial. 
Applied to sequences of morphemes with uncontrolled diversity, 
Harris's procedure becomes particularly unwieldy. We suggest that we 
might achieve the same results as Harris by using fixed-length subsequences 
rather than some higher-level syntactic unit. Thus for some fixed k, the 
co-occurrence measures 
Rk(al...a k) Rk (a2.°.ak+ l) ... Rk(an_k+l...a n ) 
might yield the same segments as th~ sequence 
R(al)R(ala 2) ... R(a I ...a n ) • 
A Segmentation Procedure 
The placing of segment boundaries at positions of maximum freedom 
of combination realizes separability, but the requirement that a word 
should be a morpheme sequence showing strong internal association is 
accounted for only in a negative way--we do not place boundaries at 
positions of low freedom of combination. We propose another procedure 
for grouping morphemes by combining both left and right freedom of 
co-occurrence. As a result we derive a scale of degrees of distributional 
separation. 
In Harris's procedure there is sufficient information to form a 
ranking of boundaries. If al...a n is the sequence to be segmented then 
we place a boundary between ak and ~ ak+ 1 if one or more of the following 
conditions is met. 
1. R(alo..ak) is a relative maximum in SR. 
2. L(ak+l...an) is a relative maximum in SL. 
3. R or L are large in comparison with the number of different 
phonemes. 
If any two of these conditions are satisfied, we have stronger 
distributional evidence for segmentation than in the case of just one 
alone. Likewise, if all three conditions are fulfilled, then~we wo'~ld 
expect that a k would be a morpheme terminal more often than if just two 
of the conditions are fulfilled. We shall adopt a similar line of 
reasoning to segmentations based on the distributions of fixed-length 
sequences. 
8 
For convenience we introduce some notation. Let 
A B ) C D indicate a right-hand boundary after B, 
following from the distribution of B, 
and 
A B ( C D indicate a left-hand boundary before C, 
following from the distribution of C. 
In a "first-order segmentation" ~f the sequence XABCDE~ we will 
use only the distributional properties of single morphemes. Thus, in 
our hypothetical Case 2, we refer only to the distributional properties 
of B. 
Morpheme Pair Frequency 
AB 
EB 1 
GB 1 
HB 1 
IB 1 
JB 1 
KB 1 
Morpheme Pair Frequency 
BC 3 
BF 7 
In this case the text frequencies indicate that B has much greater 
freedom of combination on the left than on the right° Given no further 
information, we segment as A ( B C D. We formalize this decision in the 
following "Cutting Rule." 
If R(B)~L(B) cut as X A B ) C O E. 
If R(B~L(B) cut as X A ( B C D E. 
If R(B) = L(B) cut either as XA B ),C D E or as 
XA(BCDE. 
Let us insert right- or left-hand boundaries at C by use of the 
cutting rule, as we did with B. The strangest evidence for segmentation 
(separability) is in the case where R(C)>L(C), so that we place a 
left-hand boundary before C; and at the same time R(B)>L(B), so that 
we place a right-hand boundary after B. The result is indicated as 
A B ) ( C D. The weakest evidence for segmentation (adhesion) is 
where R(B)~L(B), and~ the same time R(C)>L(B). The result is 
indicated as A ( B C ) D. 
There are nine possible combinations according to the distributional 
properties of B and C. These are shown in Figure 1 , which we refer to 
as a "Segmentation Rule." The number of slashes--the "degree" of the 
boundary--indicates the relative evidence for segmentation. 
~R~B) - L(B) 
:,0 
=0 
<0 
R(c) - L(C) 
>0 ----0 <0 
BIIo BIIIo BIIIIC 
BIC BIIc BIIIC 
BIc BIIc 
Figure 1. Segmentation Rule 
The first sample which we will consider for purposes of illustration 
is from the primer Ted and SaSAIy. This text contains 121 different printers' 
words in all. As in other deliberately morphemically closed 
lO 
texts, Zipf's law does ne~ operate so we have a large variety of 
contextual combinations with many repetitions. The sample consists of 
the first &,670 morphemes and forms the main narrative. We obtain the 
segments : 
come//Boots//sai d//Ted//// 
come and//ride//// 
come and//ride///in///my wagon//// 
jump/in///Boots//sai d//Ted//// 
ride/ / / in / / /my //wagon //Boots / / ~ 
jump/inland//ride~I~~ 
herellwellgollsai dl/Ted 
The foregoing segmentation is first order in that inference is made 
using only the distribution properties of single morphemes. The procedure 
may be extended to consider n-tuples of units for "n-th order" rules. 
However rules using extended context have two difficulties. One is the 
simple difficulty of finding enough context in a short text. A second, 
more interesting restriction is that certain boundaries may not follow 
each other, depending on the order of the segmentation. For example, two 
zero-degree boundaries may not follow each other under a rule of any order. 
The simple type counts, as measures of freedom of co-occurrence may 
be replaced by other more general measures, for example the entropy E of 
the type-frequency distribution. See, for example, Khinchin (1957). 
l 
Entropy has the desirable property that it may be used to estimate the 
average number of morphemes that may co-occur with a given unit. For 
example, if the unit U has entropy ~(U) of successors 3 then the 'rdiversity" 
ll 
of successors is 2 E'U'.(~ The entropy would be the same if all the 2 E(U) 
successors were equally likely. 
Evaluation Procedures 
Applied to real data, the constructive procedures of Greenberg 
and Juilland are developed with the aid of many illustrative examples, 
but are still programmatic end have not been applied to large linguistic 
samples. Likewise, Harris gives the morphemic segmentation of many 
sentences but does not give a numerical evaluation of his results for a 
large text.- 
In evaluating our approximation procedures, we will be concerned 
with degrees of adequacy. The results presented so far suggest that 
there is a strong correspondence between the degree of a segment boundary 
and the corresponding syntactic boundary. It appears that segment 
boundaries of zero end first degrees correspond to intra-word boundaries, 
second-degree segment boundaries to word boundaries, and third-end fourth- 
degree boundaries to phrase and sentence boundaries. 
To determine the correspc~enee, we give a more precise formulation. 
In the morpheme sequence X A B C D E let B I and CI be the lowest level 
constituents containing B and C respectively. It may happen that B I= B 
and C I = C. If BI and C I belong to the same printers' word, then the 
~ntactic boundary between B and C is a moroheme boundary. If B I and C I 
do not belong to the same printers' word, then the syntactic boundary 
between B and C is labeled according to the highest syntactic level of 
B I CI. or 
12 
Thus in the sequence ~ ~entlemanlv the space marks a morpheme 
boundary, since ungentlemanly is a printers' word. However in the 
~L~g_~g~, where B = ~ and C ~_~, B1 = the ~ of England 
and C 1 = 's. Consequently we take the boundary between ~ and 
's as a phrase boundary. ' In the two word sequences, ~he man and he 
went, the spaces mark word and phrase boundaries respectively. 
Between any two morphemes we have 20 possible combinations of 
syntactic and segment boundaries. The correspondence my be e~mluated 
by the ~ statistic, or derived statistics such as the contingency 
coefficient C -~//~÷~° See, for axmmple, Kendall (1952). 
13 
Some Distributional Groupings 
We examine the correspondence between syntactic and segment 
boundaries using several samples of morphemic data. 
In mmny cases a ~ero-degree pair occurs in a manner which is only 
barely statistically significant. Let us compare. 
look Sally 
R/L ~3/1~ 39/33 
Sign (R-L) - + 
and 
come and 
R/L 19/32 21/19 
Sign (R-L) - + 
For the sequence lo0kSall~, the differences (R-L) appear to be 
statistically significant, but in _qg~, we may wonder whether the 
slight positive value of R(and) - L(and) is due to sampling variations. 
lu a statistical version of our procedure, we test the hypothesis 
that R(and)~L(and). Since there is no exact sampling theory for this 
test, we construct an approximate test. The A6~6 morpheme text is 
divided into approximately equal blocks, and R-L ce~puted for each block 
separately. The values of R-L my be viewed as independent samples, 
provided the individual block size is large enough. We infer from the 
signs of R-L in each block that R(come)<L(come). But we my not infer 
that R(and)}L(and), since the positive difference occurs in only one 
trial in five. On the other hand, for the pair look ~, R(look)< 
L(look) and R(Sally)>L(Sally) in all five blocks. 
Considering the ~6~6 morpheme text as a statistical sample, the 
inferred zero-degree segments are sai d, look Ted, l~k Sally, and 
run Ted. If they occurred, run Sally, look Ted. say Ted and say 
would also have zero degree, while come run and come lo__~ would be of 
second degree. 
The next sample is from a lower school r~ader. The corpus is 
the first 21OO morphemes from a simplified version of _RobinsQn Crusoe. 
Even though this text is simplified, it is fairly representative of 
ordinary language and the frequency distribution follows Zipf's law. 
The words are morphemically simple, but many morphemes occur only once. 
Eor the first sentence, the morphemic representation and the groupings 
relative to samples of the first 300, 600, ..., 2100 morphemes follow. 
The ship be ing fit ed out I go ed on board the one st of 
September 1659. 
Theshlp be ingfittedout lwent onboard thefirstofSpetamberl659 
Theship beingfittedout I went onboard thefir st of Septemberl659 
Theship beingfittedout I went onboard the first of Septemberl659 
The ship belngfittedout I went onboard the first of Septemberl659 
• • • 
The ship beingfittedout I went onboard the first of Septemberl659 
As soon as the sample reaches 12OO morphemes, the segmentation 
becomes stable. In this first sentence ing fltt ed and Qf September 1659 
remain unsegmented since fit, September, and 16~ occur only once each 
15 
and we lack distributional information. The pairs ~, fitted, 
e~, and fir st are coextensive with printers' words. On board shows 
strong association and is operationally a word. The morpheme the shows 
strong disassociation in the context the////f~st, but neutral association 
in the context the//shio. 
Several high-frequency morphemes tend to occur early in the text 
so that we have fairly extensive distributional information for the first 
sentence, but less information for morphemes occurring later in the text. 
In this sample there are 124 different morphemes. Of these 215 occur 
only once and 82 only twice, so we have little information for segmentation. 
On the other hand, the high-frequency morphemes the, ship, be, in~, out, 
... all occur in the first sentence. A consequence is the poor performance 
of the procedure when applied to more than the first two sentences. See 
table 
A final example is Quine's Word and Ob.iect. We show the segmentation 
of this sentence relative to a sample of 900 morphemes. Even though the 
words tend to be polymorphic, the morphemic diversity is smaller than 
that found for the first 900 morphemes of Kobinson Crusoe. The values 
are &.O and 5.1, respectively. It follows that morphemic combination 
in WQ~ and 0b~ect is more restrained and the occurrence of longer words 
does not imply more freedom of morphemic combination. 
The segmentation follows. 
For // the / case / of //// sent // ence s //// general ly //// 
how ever // or //// even // the / case / of // e tern al // 
sent // ence s //// general ly sure ly //// there // i s /// 
16 
no I thing III approach Ing a Ill/fixed Ill/stand ard // 
of 111/ how I far III in II direct IIII quotation H/my III 
de viate /// from // the // di root. 
The morpheme groupings are: 
For thecaseof sent ences generally however or even 
thecaseof eternal sent ences generallysurely there is 
no thing approachinga fixed standard of howfar in 
direct quotation may deviate from the direct. 
Some n~nerical results are summarized in Table 1 . The measures 
of correspondence are between word boundaries and segment boundaries 
of degrees two, three, or four. In Table I , Length refers to the text 
length in morphemes, and N is the number of boundaries for which the 
correspondence measures were computed. 
Text 
Ted and ~%lly 
Robinson CrusQe 
Word andOb~ec~ 
Length 
~6~6 
2100 
9oo 
Rule 
Second 
Order 
First 
Order 
First 
Order 
N X 2 C Diversity 
197 10~.A .59 2.8 
95 5.0 .O7 7.0 
95 35.8 .85 ~.i 
Table i. Stmmmry of word and segment correspondences. 
The general conclusion is that words do co~respond to segments of 
at least second degree in a statistically significant manner. The 
correspondence, however,, is dependent on text length and style. 
17 
Left-Right Linguistic Asymmetry 
In applying Harris's procedure to our test data, we observe that 
the segments obtained from the R's alone were different from the segments 
obtained from the L's alone. 
Using entropy as a measure of freedom of co-occurrence, and seg- 
menting after each macimum in ~, we obtain the first-order segments: 
come Boots / said Ted / 
come and ride / 
come / and ride / in my wagon / 
jump in Boots / said Ted / 
Placing a boundary before every maximum in EL, we obtain the segments: 
come / Boots said Ted / 
come / and ride / 
come and ride in / my wagon 
jump in / Boots sai d / Ted 
Combining E R and ~, we obtain the segments: 
come // Boots // said // Ted //// 
come and // ride //// 
come and // ride //// in r~y wagon //// 
jump in // Boots sai d Ted //// 
Notice that the segments following from the ~'s alone are in better 
agreement with conventional syntactic units than those following from the 
EL'S alone. Using just the EL'S we obtain: Boots sai d Ted, co~ and 
ride______~, Boots said as segments which are not easily identifiable as 
phrases. 
Notice also that fourth-degree boundaries coincide more often with 
those following f~om the ER's than those following fromthe EL'S° This 
suggests that there is more information for segmentation in~foliowing 
units as compared to preceding units. 
If we examine the phonemic examples in Harris's paper, e.g. 
~ s a y 1 o w w o h 1 z w ~ r ~ p 
R 5 29 15 15 28 7 5 29 7 l 8 29 29 7 2 29 9 29 
L 24 3 23 lO 2 27 5 3 23 16 1 8 18 23 2/~ 5 23 3_I 
or 
Th~ silo walls were up 
i t k a n t e y n z ~ i u w m i n ~ m 
R i0 28 ii ii 27 7 6 6 3 28 21 9 2 9 28 & i0 2 28 
L 22 19 21 1 1 7 7 3 7 16 22 1 1 1 2 1 5 13 9 
It contains aluminum. 
we find that the range of following phonemes is larger than that of the 
preceding. In It contains aluminum, for example, the range of successors 
is 28-2 = 26 and that of predecessors is 22-1 = 21. Moreover, the R's 
and L's give different segments. From the R's ~e obtain 
it/ k a n/teynz/@lu~n/in/B m 
From the L's alone we obtain 
it~ @n/teynz/@ luwmin/@m 
19 
Another example of different segmentation resulting from following 
and preceding units is found in ~on (1963). In this study the 
linguistic units were Fries' classes, and the sample a text of 5000 words. 
The second-order segments from the following classes are 
If one believes/ that all questions raised/by science/... 
The reverse segmentation gives: 
If/one believes that all/questions raised by/science 
In this text, the variance of E R is larger than that of E L. 
A related result is Johnson's (1965) experiment which relates 
constituent structure to memory blocks. Carried out in reverse order, 
where Ss are expected to remember preceding words, constituents are not 
so well isolated. 
In our primer data, following morphemes are more variable than 
preceding morphemes. Using entropy as a measure of diversity, 
E(E~) = E(EL) = 3.18, 
where E indicates expected value. It may be shown that the expected 
value of right and left entropies must be equal. But for the variances 
we find 
Var(E R) = 2.33 and Var(EL) = 1.98. 
The difference Var(ER) - Var(EL) is significant for this sample. 
For the application of our segmentation rules it is of interest that 
E R - E L is more closely correlated with E R than it is with E L . And, in 
20 
fact, in all the English samples that we have considered, Var(ER) >Var(EL). 
Moreover, in these samples Cor(IE R - ELI , ER)~ Cor(JE R - ELI , EL) . The 
variances and correlations are shown in Table 2. 
red and Sal_~y 
Robinson Crusoe 
Word andOb'ect 
Length 
~646 
2100 
9OO 
Var(E R) 
2.33 
3.63 
2.19 
Var(E L) 
1.98 
3.46 
1.99 
Cor(4~-F~l ,~) 
.61 
.32 
.37 
Cor(~R-EL J,EL) 
.51 
.24 
.19 
Table 2. Variances and Correlations. 
These measures of directional diversity apparently reflect that the 
language is a unidirectional process. This is to be expected in a 
suffixing language such as English. We wonder if some directional asy~mmtry 
is a property of all natural languages. 
21 
Text Specific Compounds 
One purpose of this paper was to clarify the distributional 
nature of the word. The assumption has been that a word is a cluster 
of morphemes. A quantification of what one might mean by "cluster of 
morphemes" leads to the segmentation rules, and we have presented the 
results of their application in numerical detail. 
The hypothesis that words are clusters of morphemes according to 
our interpretation is partially verified by the data that have been 
presented, but the results remain suggestive rather than definitive. 
Printers' words and distributional groupings are coextensive with a 
much greater-than-chance frequency. Moreover, in one case at least, 
there is a close correspondence between the degree of distributional 
separation of morphemes and the corresponding syntactic boundaries. 
An ofttimes unstated assumption in statistical studies of language 
is that the results would become better if the sample size were larger. 
This assumption is confirmed, but only in a restricted sense. In 
the specialized language of the primer Ted and Sally, we used a large 
sample procedure to eliminate zero-degree segments and obtain a 
closer correspondence with printers' words. This procedure is applicable 
to the closed vocabulary of this primer, in which every morpheme is used 
many times. It would not be applicable to texts where Zipf's law holds.,," 
and most morphemes are used only once. 
A study of the relationship between segmentation and sample size 
shows that segments are quite stable and do not change with respect to 
longer and longer portions of a text. In some cases, of course, larger 
22 
samples break up segments which occurred initially ~or lack of distribu- 
tional information. The general conclusion is that the distributional 
freedom with respect to limited contexts may be established from rela- 
tively small samples. 
With regard to establishing the distributional reality of printers' 
wordsj morpheme segments of fixed order do not necessarily approach 
words as the sample size increases. The distributional clusters which 
do not correspond to printers' words furnish style indicators. Thus, we 
have the segments: lookTed, ~ in Ted an~Sally; onboard and 
on__shore in Robinson Crusoe; and ~_~ and thecaseof in Word and Object. 
These stylistic groupings show the same strong association that is found 
between the morphemes occurring within words. TheBe groupings are not 
necessarily the mcst frequent in a •sample. 
The groups onboard, thecaseof, etc. function as compounds in their 
respective texts. We may speculate about the role of morpheme frequency 
in the formation of compounds. To use our theory in a predictive sense, 
we would assert morphemes showing strong association, in the sense we have 
defined it, operate as compounds. 
Our rules enabla us to make statements about the relative ease of 
combination of linguistic units. We have already pointed out that in the 
.~obinson Crusoe sample th_~e, in the context the //h~ , shows neutral 
association, while in the context th__ee //// first, the disassociation is 
strong. A parallel example, also in Robinson Crusoe, is o nn where we 
find omboard, onshore. On the other hand, in the context of the 
prepositional phrases onus and ca them. we find the nentral associations 
on II we and on II t__h~. 
23 
These examples suggest that there are degrees of distributional 
freedom and that instead of hoping to give an absolute distributional 
characterization of the word, we should speak of degrees of distributional 
word-hood. The degree of b oundedness of the morphemes of a word is not 
an absolute property but depends on the corpus containing them, and in 
addition the context of surrounding morphemes. 
Graphemic Grouping 
The segmentation rules are numerical procedures for grouping 
linguistic units. Here we apply these rules to graphemic data. For a 
graphemic application we compare Ted and Sall~ and Word ~nd 0b~ect. 
Using letters, we can process much larger samples than we could using 
morphemes. Relative to the first 16,6AO letters of T~. and Sally, we 
obtain the segments 
Come Boots said Ted 
In this simple text almost all words can be isolated from letter samples. 
In contrast, consider the sentence fra~nent from ~ord and 0b.~ect: 
What counts as a word as against a string ... 
Relative to a sample of 15,889 letters, the second-order se~nents from 
maxlma in R: 
What counts asa word asa gain stas tring ... 
Frc~ maxima in L: 
Wh at counts asaw ord asaga ins tast ring ... 
Combining the information from the R s and L s we obtain the segments. 
Whatcounts asa word asaga insta string ... 
2~ 
..... : " The :complexitY':of the ~ext ma~es a marked ~ifference '~ the oper- 
..... '~' :at'£O'n of O~"segmenta~'ion;ruie, Weobt~in manywo~ bo~d~ries but also 
'~ '~, :ins{a', :~`'~ A te~ of :one'~s~labi~:wo~s Such~:as "Ta and Sally, such 
combinations: do not ocdur. .... ' ~ : : ' : " 
' ....... :A :text "~e~ed~te to {he las'~' ~W0 iS the l~er school ~eader All 
Around Me. The segmentation reiative tO ............ 15760 letters ~ows '~ an is elation 
of meaningful letter sequences, which are not necessarily words. The text 
begins : 
.... Now.Whi~ey was eleven years old, or thereabouts, He had . 
-~"s~n~i~n r~ ~vesi " ' ~ .... '~' .... ~: 
NOW W~'..te~a ~ ~leven .y,ear s. oldor there about she had ... 
This text illustrates that the segmentation ~'i~8 diStribUtion, 
giving X, 2, she .... ,":as s~ents\[ ::No p~ctuation was involved in 
r~h~~-t~ ~ ;~u%~ n:ot; ~h'~e-t~e '6~Se~ 'groupings ~' sUb~ant~y for 
, • i~ .!'- :~ ~ ~: ~,~: : ~, 25 ~ ~' 

REFERENCES 

Bloomfield, Leonard, "A Set of Postulates for the Science of Language," 
Lan~.,_2, 1926, pp. 153-16~. 

Bloomfield, Leonard, ~, New York, 1933, p. 178 

Chomsky, Noam, Word, 12, 1958, p. 217. 

Gannnon, E., Prgc., IX Inter, Cong. of Ling., 1963, pp. 507-13. 

Greenberg, Joseph, .Essays in Linguistics, Chicago, 1957, p. 27. 

Harris, James, Hermes or a Philosoohical Inquiry Concerning Universal 
Gra~, London, 1771, pp. 20-21. 

Harris, Zellig, "From Phoneme to Morpheme," _~, $1, 1955, pp. 190~23~. 

Hjelmslev, L., Omkrin~ Soro~teories Grundl~ggels¢, 1943, p. 66. 

Jakobson, R., Actes d~ IV me C~ngre8 de Linaulstes, 193~, pp. 133-3~. 

Johnson, N.F., "The Psychological Reality of Phase Structure Rules", 
J. of Verbal Learnin~ and Verbal Beh~viQr ~, 1965, pp. ~69-~75. 

Juilland, A. The Word, unpublished MS 

Kendall, M.G., The Advanced Theory of Statistics. Vol. l, New York, 
1952, p. 290 et seq. 

Khinchin, A. I., _Mathem~ical Foundation of Information Theory, New York, 
1957, pp. 2-~. 

Togeby, Knud, "Qu'est-ce qu'un mot?" Travau~ du C~rcle Ling~istioue de 
Copenha~e, V, 19~9, pp. 97-111. 

Defoe, Daniel, "Robinson Crusoe, " _Beacon Third Reader. Ginn and Co., 
Boston, 1914. 

Francis, N., The Structure of American English, New York, 1959. 

Gates, A. I., and Bartlett, M. M., All Around Me, New York, 1957. 

Gates, A. I., Haber, M. B., and Salisbury, F.S., Ted and SallE, New 
York, 1957. 

Quine, W., .Word and Object, M.I.T., 1960, pp. 13-1A. 
