MODULARITY IN A CONNECTIONIST MODEL 
MORPHOLOGY ACQUISITION 
Michael Gasser 
Departments of Computer Science and Linguistics 
Indiana University 
OF 
Abstract 
This paper describes a modular connection,st 
model of the acquisition of receptive inflectional 
morphology. The model takes inputs in the form 
of phones one at a time and outputs the associ- 
ated roots and infections. In its simplest version, 
the network consists of separate simple recurrent 
subnetworks for root and inflection identification; 
both networks take the phone sequence as inputs. 
It is shown that the performance of the two separate 
modular networks is superior to a single network re- 
sponsible for both root and inflection identification. 
In a more elaborate version of the model, the net- 
work learns to use separate hidden-layer modules 
to solve the separate tasks of root and inilection 
identification. 
INTRODUCTION 
For many natural languages, the complexity of 
bound morphology makes it a potentially challeng- 
ing problem for a learning system, wl, ether hu- 
man or machine. A language learner must ac- 
quire both the ability to map polymorphemlc words 
onto the sets of semantic elements they tel)resent 
and to map meanings onto polymorphemic words. 
Unlike previous work on connection,st morphology 
(e.g., MacWhinney ~5 Leinbaeh (1991), Plunker, 
& Marehman (1991) and Rumelhart & MeClelland 
(1986)), the focus of this paper is receptive nmr- 
phology, which represents the more fundamental, 
or at least the earlier, process, one which produc- 
tive nmrphology presumably buihls on. 
The task of learning receptive morphology is 
viewed here ,as follows. The learner is "trained" on 
pairs of forms, consisting of sequcnces of phones, 
and "meanings", consisting of sets of roots and in- 
flections. I will refer to the task as root and inflec- 
tion identification. Generalization is tested by pre- 
senting the learner with words consisting of novel 
combinations of familiar morphemes. If the rule in 
question has been acquired, the learner is able to 
identify the root and inflections in the test word. 
Of interest is whether a model is capable of ac- 
quiring rules of all of the types known for natural 
languages. This paper describes a psychologically 
motivated connection,st model (Modular Connec- 
tion,st Network for the Acquisition of Morphology, 
MCNAM) which approaches this level of perfor- 
mance. The emphasis here is on the role of mod- 
ularity at the level of root and inflection in the 
model. I show how this sort of modularity improves 
performance (lramatically and consider how a net- 
work might learn to use modules it is provided with. 
A sel)arate paper (Gasser, 1994) looks in detail at 
the model's performance for particular categories 
of morI)hology, in particular, template morphology 
and reduplication. 
The paper is organized as folh)ws. I first provide 
a brief overview of the categories of morphological 
rules found in the werhl's languages. I then present 
a simple version of the model and discuss simula- 
tions which demonstrate that it generalizes for most 
kinds of morphoh)gical rules. I then describe a ver- 
sion of the model augmented with modularity at 
the level of root and inflection which generalizes 
significantly better and show why this appears to 
be the case. Finally, I describe some tentative at- 
tempts to develop a model which is provided with 
modules and learns how to use them to solve the 
morphology identification tasks it is faced with. 
CATEGORIES OF 
MORPHOLOGICAL PROCESSES 
I will be discussing morphology in terms of the tra- 
ditional categories of "root" and "intlection" and 
morphological processes in terms of "rules", though 
it should I)e emphasized that a language learner 
does not have direct access to these notions, and 
it is an open question whether they need to be an 
explicit part of the system which the learner devel- 
ops, let ah)ne the device which the learner starts out 
with. I will not make a distinct,m, between inflec- 
tional and deriwttional morl)hoh)gy (using "inth,c- 
tion" for both) and will not consider compmmding. 
AJIixation ,revolves the addition of the inflection 
to the root (or st,,,,,), either I,efore ('~,,,'efixatio,O, 
after (su/fizatlon), within (infixation), or both be- 
fore and after (circun~Ji:r.ation) the root. A further 
type of morphological rule, which I will refer to as 
mutation, consists in modification to the root seg- 
ments themselves. A third type of rule, familiar in 
Semitic languages, is known as template morphol- 
ogy. tIere a word (er stem) consists of a root and a 
pattern of segments which are intercalated 1)ctween 
the root segments in a way which is specified within 
the pattern. A fourth type, the rarest of all, con- 
sists in the deletion of one or ,nor(; segments. A 
fifth type, like aflixation, involves the addition of 
something to the root form. But the form of what 
is added in this case is a copy, or a systematically 
214 
altered copy, of stone l)ortlon of the root. This pro- 
cess, reduplication, is it, one way the most cmnplex 
type of morphology (though it may not necessarily 
be the most difficult for a child to learn) because it 
seems to require a variable. It is not handled by the 
model discussed in this paper. G;Lsser (1994) dis- 
cusses modification of the model which is required 
to accommodate reduplication. 
THE MODEL 
The al)l)roach to hmguage acquisition exemplilied 
in this paper differs from traditional symbolic al)- 
proaches in that the focus is on specifying tile sort 
of cognitive architecture and the sort of general pro- 
ce.ssing and learning mcchani.~ms which h;we the 
capacity to learn some ;~speet of language, rather 
than the innate knowledge which this might require. 
If successflfl, such ~t model would provide a sim- 
pler account of the acquisition of morphology thav 
one which begins with symbolic knowledge and con- 
straints. Connectionlst models are. interesting in 
this regard because of their powerfi,l sul)-symbolie 
learning algorithms. But in the past, there has been 
relatively little interest in investigating the effect 
on tile language acquisitio,t capacity of structuring 
networks in particular ways. The concern in this 
l)aper will I)e with what is gained 1)y adding mod- 
ularity to a network. 
Given tile I)iusic l)rol)lem of what it means to 
learn receptive morphology, I will begin witl, one 
of the siml)lest networks that could have that ca- 
pacity and then augment the device as necessary. 
In this paper, two versions of the model are de- 
scribed. Version 1 successfidly learns simple exam- 
pies of all of tile morl)hological rules except redu- 
plication and circumfixation, but its l)erformance 
is far from the level that might be exl)ected frmn 
a human language learner. Version 2 (MCNAM 
proper) incorporates a form of built-in modularity 
which separates portions of tile network resl)onsi- 
l)le. for tile i(lentificatimt of the root and the in- 
flections; this improves the nctwork's 1)erformance 
signiticantly on all of the rule types except redupli- 
cation, which cavnot be learned even by a network 
outfitted with this form of modularity. 
Word recognition is an incremental process. 
Words are often recognized hmg before they fin- 
ish; hearers seem to be continuously coml)ariug the 
contents of at linguistic short-term memory with 
the phonological representations ill their mental 
lexicons (Marsk, n-Wilson & Tyler, 1980). Thus 
tile task at hand requires a short-term memory of 
some sort. There are several ways of represent- 
ing short-term memory in cmmectionist networks 
(Port, 1990), in particular, through the use of time- 
delay connections out of input units and through 
the use of recurrent time-delay cmmections on some 
of the network units. The most ttexible apl)roach 
makes use of recurrent connections on hidden units, 
though the arguments ill favor of this opthm are 
beyond the scope of this l)aper. The model to be 
described here is a network of this type, a version of 
the simple recurrent network due to Elman (1990). 
Version 1 
The Version 1 network is shown in Figure 1. Each 
box represents a layer of connectionist processing 
units and each arrow a cmnplete set of weighted 
connections between two layers. The network op- 
erates as follows. A sequence of l)hanes is presented 
to the input layer one at a time. Tl,at is, each tick 
of the network's chick represents the presentation 
of ~t single phone. Each l)hone unit represents a 
llhonetic fi~ature, and each word consists of a se- 
quence of i)hones l)reeede(l by a boundary "phone" 
made. up of 0.0 actiwttlons. 
"1 root ~)inflection 
:: ====================================== 
I"il~ure 1: Network for Acquisition of Morphology 
(Version 1) 
An input phone 1)attern sm,ds actiwttion to tile 
network's hidden layer. The 1,idden layer also re- 
ceives activation from the pattern that apl)eared 
there on the l)revious time stel). Thus each hidden 
unit is joined by a time-deh W connection to each 
other hidden unit. It is the previous hidde,>layer 
pattern which represents the system's short-term 
memory. Because the hi(ldcn layer has access to 
this previous state, which in turn del)ended on its 
state al. the time step before that, there is no ab- 
solute limit to the. length of the context stare(1 in 
the sho,'t-term memory. At the 1)eginnlng of each 
word sequence, the. hidden layer is reinitialized to 
a pattern consisting of 0.0 activations. 
l:'inally the output re,its are activated l)y the hid- 
den layer. There are three output layers. One rep- 
rese.nts simply a copy of the current input l)hone. 
Training the network to auto-associate its current 
input aids in learning the root and inflection identi- 
fication task because it forces the network to learn 
to distinguish the individual phones at the hidden 
layer, a prerequisite to using the short-term mem- 
ory effectively. The second layer of output uvits 
rel)resents the root ",neavlng". For each root there 
is a single outlmt unit. Thus while there is no real 
semantics, the association between the inl)ut phone 
sequence and the "meaning" is at least an arl)itrary 
215 
one. The third group of output units represents the 
inflection "meaning". Again there is a unit for each 
separate inflection. 
For each input phone, the network receives it tar- 
get consisting of the correct phone, root, and inflec- 
tion outputs for the current word. The phone target 
is identicM to the input phone. The root and in- 
flection targets, which are constant throughout the 
presentation of a word, are the patterns associated 
with the root and inflection for the input word. 
The network is trained using the backpropa- 
gation learning algorithm (Rumelhart, IIinton, & 
Williams, 1986), which adjusts the weights on all 
of the network's connections in such a way as to 
minimize the error, that is, the difference between 
the network's outputs and the targets. For each 
morphological rule, a separate network is trained 
on a subset of the possible combinations of root 
and inflection. At various points during training, 
the network is tested on unfamiliar words, that is, 
novel combinations of roots and inflections. The 
performance of the network is the percentage of the 
test roots and inflections for which its output is cor- 
rect at the end of each word sequence when it has 
enough information to identify both root and in- 
flection. A "correct" output is one which is ch>ser 
to the appropriate target than to any of the others. 
In all of the experiments reported on here, the 
stimuli presented to the network consisted of words 
in an artificial language. The phoneme inventory 
of the language was made up 19 phones (24 h)r the 
mutation rule, which nasalizes vowels). For each 
morphological rule, there were 30 roots, 15 each 
of CVC and CVCVC patterns of phones. Each 
word consisted of two morphemes, a root and a 
single "tense" inflection, marking the "l)resent '' 
or "past". Examples of each rule: (1) suffix: 
present-vibuni, pmst-vibuna; (2) prefix: present- 
ivibun, past-avibun; (3) infix: prescnt-vikbun, 
past-vinbun; (4) circumfix: 1)rescnt-ivibuni, pmst- 
avibuna; (5) mutation: prcsent-vibun, past-vib.Sn; 
(6) deletion: prescnt-vibun, l)ast-vibu; (7) tem- 
plate: present-vaban, past-vbaan. 
For each morphological rule there were 60 (30 
roots x 2 inflections) dilferent words. From these 
40 were selected randomly as training words, and 
the remaining 20 were set a.side as test words. For 
each rule, ten separate networks, with different ran- 
dom initial weights, were trained for 150 epochs 
(repetitions of all training patterns). Every 95 
epochs, the performance of the network on the test 
patterns was assessed. 
Figure 2 shows the performance of the Version 
I network on each rule (,as well as perfor,nance on 
Versiou 2, to be described below). Note that chance 
perforrnance for the roots was .033 and for the. iu- 
fiections .5 sluce there were 30 roots and '2. inflec- 
tions. There are several things to notice in these re- 
sults. Except for root identification for the eircum- 
fix rule, the network performs well above cllance. 
IIowever, the results are still disappointing in many 
cases. In particular, note the poor performance ou 
root identification for the prefix rule and inflection 
identification for the sufHx rule. The 1)chavior is 
much poorer than we might expect from a child 
learning these relatively simple rules. 
The problem, it turns out, is interference between 
the two tasks which the network is faced with. On 
the one hand, it must pay attention to infornmtion 
which is relcwtnt to root identification, on the other, 
to information relevant to inflection identification. 
This means making use of the network's short-term 
memory in very different ways. Consider the pre- 
fixing case, fl)r example. Here for inflection identifi- 
cation, the network need only pay" attention to the 
first phone and then remelnber it until the end of 
the sequence is reached, ignoring all of the phones 
which appear in between. For root identification, 
however, the network does best if it ignores the ini- 
tial phone in the sequence anti then pays careful 
attention to each of the following phones. 
hleally the network's lfidden layer would divide 
into modules, one dedicated to root identification, 
the other to inflection identificatlon. This could 
happen if some of the recurrent hidden-unit weights 
and some of the weights on hidden-to-output con- 
nections went to 0. tIowcver, ordinary backpropa- 
gation tends to implement sharing among hidden- 
layer units: each hidden-layer unit participates to 
some extent in activating all output units. When 
there arc conflicting output tasks, as in this ease, 
there are two sorts of possible consequences: ei- 
ther performance on both t~sks is mediocre, or the 
simpler task comes to dominate the hidden layer, 
yielding good performaacc on that task and poor 
performance on the other. In the Version 1 results 
shown in Figure 2, we see both sorts of outcomes. 
What is apparently needed is modularity at the 
hidden-layer level. One sort of modularity is hard- 
wired into the network's architecture in Version 2 
of the model, described in the next section. 
Version 2 
\]\]ccause root and inflection i<lentitication make con- 
tlicting demands on the network's short-term mem- 
ory, it is predicted that performance will improve 
with scparate hid<len layers for the two tasks. Var- 
ious degrees of modolarity are possible in connec- 
tionist networks; the form implemcllted in Version 
2 of the model is total modularity, coml)letely sep- 
arate networks h>r the two tasks. This is shown 
in Figure 3. There are now two hidden-layer mo<l- 
ules, each with recurrent connections only to milts 
within the same module and with connections to 
one of the two output identification layers of units. 
(Both hidden layers connect to the auto-associative 
phone output layer.) 
The same stimuli were used in training and test- 
216 
il 1 .................................................... ~ ........................................ : ................. ~; .............................. 7: ......................................................................................... i 
i 
-~ 0.6 .... . 
o. 
z 0.4 - i 
i m 0.2 ........ 
suffix Prefix Infix circumfix Delete Mutate Template 
~ ~ Root, V.1 ~N,,,",~ Root, V.2 Chance .... 
I E~ Inflection, V.1 ~x~'-~ Inflection, V.2 Chance ........ 
Figure 2: Performance on Test Words I"ollowing Training (Network Versions 1 and 2) 
\[\[t , root I inflection I 
,.!'\[ phone 
i..!ZZZZZZZZZZ .t :~ 
Figure 3: Network for Acquisitiou of Merl>hology 
(Version 2) 
ing the Version 2 network as the Version 1 network. 
Each Version 2 network had the same number of 
total hidden units as each Version 1 network, 30. 
Each hidden-layer module contained 15 units. Note 
that this means there are fewer connections in the 
Version 2 than the Version 1 networks, hwestiga- 
tions with networks with hidden layers of different 
sizes indicate that, if anything, this should faw)r 
the Version 1 networks. 
Figure 2 comp~tres results from the two versions 
following 150 epo('hs of training. For all of the 
rule typcs, modularity improves pcrfornlance for 
both root and inilection identification. Olwiously, 
hidden-layer modularity results in diminished inter- 
feren<:c between tile two output tasks. Performance. 
is still far from perfect for some of the rule types, 
but further iml>rovcment is l>ossible with optin|iza- 
tion of the learning parameters. 
TOWARDS ADAPTIVE 
MODULARITY 
It is important to 1)c clear on tile nature of the mod- 
ularity being prol)osed here. As discussed al)ove, I 
have (lefitLe(l the task of word recognition in such 
~t way that there is a built-in distinction between 
lexical :tad grammatical "meanings" because these 
are localized iu separate ~)utl)ut layers. Tit(.' modu- 
lar architecture of Figure 3 extends this distin(:tiou 
into the domai|| of phonology. That is, the shape 
of words ix rel)resente(l iuternally (on the hidden 
layer) in terms of two distinct patterns, one for the 
root and one for the inflection, and the network 
"knows" this even before it is trained, though of 
course it does not know how the root and intlec- 
tiens will 1)e realized in the language. 
A fitrther concern arises when we consider what 
hapl)ens whcx~ more than one grammatical c~ttegory 
is represented in tile words Acing recognized, for 
example, aspect in addition to tense on verbs. As- 
suming the hidden-layer modules are a lmrt of tile 
innate makeul) of tile learning device, this nteans 
that it fixed number of given modtdes must be di- 
vided up among the separate outl)ut "tasks" which 
217 
the target language presents. Ideally, the network 
would have the capacity to figure out for itself how 
to distril)ute the modules it starts with among the 
various output tasks; I return to this possibility be- 
low. But it is ,also informative to investigate what 
sort of a sharing arrangement achieves the best per- 
formance. For example, given two modules and 
three output tasks, root identification and the iden- 
tification of two separate inflections, which of the 
three possible ways of sharing the modules achieves 
the best performance? 
Two sets of experiments were conducted to in- 
vestigate the optimM use of fixed modules by a 
network, one designed to determine the best way 
of distributing raodnles among output tasks when 
the number of modules does not match the num- 
ber of output tasks and one dcsigne<l to determine 
whether a network could assign the modules to the 
tasks itself. In both sets of experiments, the stim- 
uli were words composed of a stem an<l two affixes, 
either two suffixes, two prefixes, or one prefix and 
one suffix. (All of these possibilities occur in natu: 
ral languages.) The roots were the same ones used 
in the afl\]xation and deletion experiments already 
reported. In the two-suffix ease, the first suffix was 
/a/ or/i/, the second suffix /s/ or /k/. Thus the 
h>ur forms for the root migon were migonik, migo- 
nis, migonak, and migonas. In the two-prefix case 
tit(': l,retixcs were/s/or/k/alid /a/ or /i/. In the 
prefix---sufflx case, the prefix ,,'as/u/or/e/and the 
suffix lal or Ill. 'there ,,,ere in all case~ two hidden- 
layer modules. The size of the nlodules was sltch 
that the root identilieation task had potentially 20 
units and each of the inilection identification tasks 
potentially 3 units at its disposal; the sum of the 
units in the two modules was always 26. 
The results are only summarized here. The con- 
tiguration in which a single nm(tule is shared by the 
two affix-identification tasks is consistently superior 
for petbrmance on root identification but only su- 
perior for affix identification in the two-sufflx case. 
For the l)refix-sullix case, the configuration in which 
one module is shared by root identification and suf- 
fix identification is clearly inferior to the other two 
configurations for performance on snflix identifica: 
tion. For the two-preflx ciLsc, the configurations 
make little diffcrcnce for performance on identifica- 
tion of either of the prefixes. Note that the results 
for file two-prefix and two-suffix cases agree with 
those for the single-prefix and single-suffix cases re- 
spectively (Figure 2). 
What the results for root identification make 
clear is that, even though the affix identification 
tasks arc easily learned with only 3 units, when they 
are provided with more units (23 in these experi- 
mcnts), they will tend to "distribute" themseh,es 
over the available units. If this were not the case, 
performance on the competing, and more difficnlt, 
task, root identification, wouhl be no better when 
it has 20 units to itself than when it shares 23 units 
with one of the other two tasks. 
We conclude that the division of labor into sep- 
arate root and inflection identification modules 
works best, I)rimarily because it reduces interfer- 
ence with root identification, but also for the two- 
suffix ease, and to a lesser extent for the prefix- 
suffix case, because it improves performance on af- 
fix identification. If one distribution of the awtil- 
able modules is more efficient than the others, we 
wouhl like the network to be able to find this dis- 
tribution on its own. Otherwise it wouhl have to 
be wired into the system from the start, and this 
wouhl require knowing that the different inflection 
tasks belong to the same category. Somc form of 
adaptive use of the awtilable modules seems cMled 
for. 
Given a system with a fixed set of modules but no 
wired-in constraints on how they are used to solve 
the wtrious output t~sks, can a network organize 
itself in such a way that it uses the modules effi- 
ciently? There has been considerable interest in the 
last few years in architectures which are endowed 
with modularity and learn to use the modularity 
to solve tasks which call for it. The architecture 
described by Jaeobs, Jordan, & 13arto (1991) is an 
example. In this approach there are connections 
from each modular hidden layer to all of the out- 
put units. In addition there are one or more gating 
networks whose function is to modulate the input 
to the ontpnt units from the hidden-layer modules. 
In the version of the architecture which is appropri- 
ate for domains such as the current one, there is a 
single gating unit responsible for the set of connec- 
tions from each hidden nmdule to each output task 
gronl). The outl)uts of the modules are weighted 
by the outl)uts of the corresponding gating units 
to give the output of the entire system. The whole 
network is trained using backl)ropagation. For each 
of the niodules, the error is weighted by the vahle of 
the gating input as it is l>assed back to the modules. 
Thus each niodule adjusl;s its weights in such a way 
that the difference, between the system's output and 
the desired target is mininlized, and the extent to 
which a nio<htle's weights are change<l <leiden<Is on 
its contribution to the outl)ut. For the gating net- 
works, the error function implcments coml)etition 
among the modules for each output task group. 
For our purposes, two further augmentations are 
required. First, we are dealing with recurrent net- 
works, so we permit each of the modular hidden 
layers to see its own previous values in ad(lition to 
the current input, but not the l)revious values of 
the hidden layers of the other modules. Second, we 
are interested not only in competition among the 
modules for the output groups, but also in coml)e- 
tition among the outpnt groups for the modules. 
In particular, we would like to prevent the network 
from assigning a single module to all output tasks. 
218 
To achieve this, the error function is modified so 
that error is mi,fimized, all else l)eing equal, when 
the total of the outputs of all gating units dedicated 
to a single module is neither close to 0.0 nor close 
to the total number of output groups. 
Figure 4 shows the arctfitccture for the situa- 
tion in wlfich there is only one intlection to be 
lcarne(l. (The auto-associative phone output layer 
is not shown.) The connections ending in circles 
symbolize the emnpcfition between sets of gating 
units which is built into the error function for the 
network. Note that the gating units have no in- 
put connections. These units have only to learn 
a bias, which, once tile system is stable, h.'ads to 
a relatively constant outlmt. The ~ussumption is 
that, since we are dealing with a spatial crosstalk 
l)roblem, the way in which l)articular modules are 
assigned to particular tausks shonld not wny with 
the inl)ut to the nctwm'k. 
~- 0~... hidden2 
/~--~ ~ hidden t 
"6 o 'E 
Figure 4: Adal)tive Modular Architecture fro" Mor- 
phology Acquisition 
An initial experiment demonstrated that the 
adaptive modular network consistently assigne(1 
separate modules to the output tasks when the,'e 
were two modules and two tasks (identification of 
the root and a single intlection). 
Next a set of experiments tested whcthe.r the 
adaptive modular architecture wonhl assign two 
modules to three tasks (root and two intlections) 
in the most efficient way for the two-suffix, two- 
prefix, and prefix-suffix cases. Recall that tile most 
efficient patteru of connectivity in all cases was the 
one in which one of the two modules was sl,ared by 
the two affix identification tasks. 
Adaptive mmlular networks with two modules of 
15 units each were trained on tile two-sufflx, two- 
prefix, and prefix-suffix tasks described in the last 
section. Following 120 epochs, the outputs of the 
six gating units fl)r the different modules were ex- 
anfined to determine how the modules were shared. 
The results were completely negative; the three 
possible ways of assigning the modules to the three 
identilication tasks occurred with approximately 
equal frequency. The prolfiem was that the inflec- 
tion identilication tasks were so nmch easier than 
the root identilicatlon task that they claimed the 
two modules for themsclves early on, while neither 
module was strongly prefc,'red by the root task. 
q_'hus as often as not, the two inflections ended up 
assigned to dil\['erent modules. To compensate for 
this, then, is it reasonable to give root identifica- 
tion some sort of advantage over i,dlection identiti- 
cation? It is well-known that children begin to ac- 
quire lexlcal morphemes before they acquire gram- 
matical morphemes. Among the reasons for this 
is llrobably the more abstract nature of the lncan- 
ings of the grammatical morphemes. In terms of 
the network's tasks, this relative difficulty wouhl 
translate into an inability to know what the inllee- 
tion targets would I)e fl>r particular inlmt patterns. 
Thus we couh\[ m<>del it by (lelayi,lg training on the 
inltection identification task. 
The exl)eriment with the adaptive modular net- 
works was repeated, this tixne witl, the fl)llowing 
training regimen. Entire words (consisting of root 
and two at\[ixes) were i)resented throughout train- 
ing, lint for the first 80 epochs, the network saw 
targets for only the root identification task. That 
is, the connections into the output units for the two 
inilcctions were not altered during this plume. I-'of 
lowing the 80th epoch, by which time the network 
was well on its way to learning the roots, train- 
ing on the inllections was introduced. This pro- 
cedure was followed for the. two-sulfix, twoq)retix, 
and prelix-sul\[ix tasks; 20 sel):trate networks were 
trained for each type. For the two-sutlix task, in all 
cases the network organized itself in the p,'cdicted 
way. That is, for all 20 networks one of the mod- 
nh.'s was associated mainly with the two intlectio,, 
output units and the other associatcd with the root 
output units. In the preilx-suflix case, however, the 
results were more equivocal. Only 12 out of 20 of 
the networks organized themselves in such a way 
that tile two intlecti(m tasks were shared by one 
module, while in the 8 other cases, one module w~s 
shared by the root and pretix identitication t~sks. 
Finally, in the two-pretlx case, all of the networks 
organized themselves ill Sllch a v,'ay that the root 
and the first pretix shared a module rather than in 
the apllarently more eillcient contlguration. 
The ditt'erence is not surprising when we consider 
the nature of the advantage of the configuratioit 
219 
in which the two inflection identification tasks are 
shared 1)3, one module. For all three types of af- 
fixes, roots are identified better with this configu- 
ration. But this will have little effect on the way the 
network organizes itself becanse, following the 80th 
epoch when competition among tile three output 
tasks is introduced, one or the other of tile mod- 
ules will already be firmly linked to the root out- 
put layer. At this point, the outcome will depend 
mainly on the competition between tlle two inflec- 
tion identification t,'~sks for the two modules, the 
one already claimed for root identification and the 
one which is still unused. Thus we can expect this 
training regimen to settle on tl,e best configuration 
only when it makes a significant ditference for in- 
flection, as opposed to root, identification. Since 
this difference was greater for tile two-suflix words 
than for the prefix-sufl\]x words and virtually non- 
existent for the two-prefix words, there is the. great- 
est preference in the two-suffix case for tile config- 
uration in which the two inflection tasks are shared 
by a single module. It is also of interest that for tile 
prefix-suffix cruse, tile network never chose to share 
one module between the root and the suffix; this is 
easily the least efficient of the three configurations 
from the perspeetlve of inflection klentlficatlon. 
Thus we are left with only a partial sohttion to 
tlle problem of how the modular architecture might 
arise in the first place. For circumstances in which 
the different sorts of modularity impinge on inflec- 
tion identification, the adaptive api)roach can find 
the right configuration. When it is performance on 
root identification that makes the difference, how- 
ever, this api)roach has nothing to offer. Future 
work will also have to address what happens when 
there are more than two modules and/or more than 
two intlections in a word. 
CONCLUSIONS 
Early work applying connectimfist networks to 
high-level cognitive tasks often seemed based on the 
assumption that a single network wouhl l)e al)le to 
handle a wide range of phenomena. Increasingly, 
however, the emphasis is moving in the direction 
of special-l)urpose modules for subtasks which may 
eontlict with each other if handled by the same 
hardware (aacobs et al., 1991). These apl)roaches 
bring eonnectionist models somewhat more in line 
with tile symbolic models which they seek to re- 
place. In this paper I have shown how tile ability of 
simple recurrent networks to extract "structure in 
time" (Ehnan, 1990) is enhanced by built-in modu- 
larity which I)ermits the recurrent hidden-unit con- 
nections to develop in ways which are suitable for 
the root and inflection identification tasks. Not(., 
that this modularity does not amount to endowing 
the network with the distlnctiml 1)etween root and 
affix because both modules take the entire sequence 
of phones as input, and the modularity is the same 
when tile rule being learned is one for which there 
are 11o affixes at all (mutation, for examph!). 
Modular approaches, whether symbolic or con- 
nectionist, inevitably raise fllrther questions, how- 
ever. The modularity in the pre-wired version of 
MCNAM, which is reminiscent of the traditional 
separation of lexical and grammatical knowledge in 
linguistic models, assumes that the division of "se- 
mantic" outlmt units into lexical and grammatical 
categories has already l)een made. The adaptive 
version partially addresses tills shortcoming, lint it 
is only etfective in cases where modularity 1)cue- 
fits inflection identification. Furthermore, it is still 
based on the assumption that the output is divided 
initially into groups rel)resenting separate compet- 
ing tasks. I am currently experimenting with re- 
lated a(lal)tive approaches, as well as inethods in- 
volving weigl,t decay and weight pruning, which 
treat each output unit as a separate task. 
References 
El,nan, J. (1990). Fiudi,,g structure in time. Cog- 
nitive Science, 14, 179-211. 
Gasser, M. (199,1). Acquiring receptive morphol- 
ogy: a counectionist model. Annual Meeting 
of the Association for Computational Linguis- 
tics, 32. 
Jaeobs, R. A., Jordan, M. I., & Barto, A. G. (1991). 
Task decomposition through competition in at 
modular conneetionist architecture: the what 
and where vision tasks. Cognitive Science, 15, 
219-250. 
MacWhinney, B. ~ Leinbach, J. (1991). hnplemen- 
tatkms are not conceptualization: revising the 
verb learning model. Cognition, 40, 121-157. 
Marslen-Wilson, W. D. & Tyler, L. K. (1980). The 
temporal structure of Sl)oken language under- 
standing. Cognition, 8, 1--71. 
Plunkett, K. & Marchman, V. (1991). U-shaped 
learning and frequency effects in at multi- 
layered 1)erc('ptron: implications for chihl lan- 
guage acquisition. Cognition, 38, 1-60. 
Port, R. (1990). Representation and recognition 
of teml)oral 1)atterns. Connection Science, 2, 
151-176. 
Rumelhart, D. E. & McCMland, a. L. (198G). On 
learning the past tense of English verbs. Iu 
McClelhmd, .l.L. ~ Rumelhart, D. E. (Eds.), 
Parallel Distributed Processing, VohLme 2, pp. 
216 -271. MIT Press, Cambridge, MA. 
Rumelhart, D. E., IIinton, G., ,~ Williams, R. 
(1986). Learning internal representations by 
error propagation. In Rmnelhart, D. E. & Me- 
Clelland, a. L. (Eds.), Parallel Distributed Pro- 
cessing, Volume 1, pp. 318-304. MIT Press, 
Cambridge, MA. 
220 
