Proceedings of the Second Workshop on Psychocomputational Models of Human Language Acquisition, pages 36–44,
Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
A Connectionist Model of Language-Scene Interaction
Marshall R. Mayberry, III Matthew W. Crocker Pia Knoeferle
Department of Computational Linguistics
Saarland University
Saarbr¨ucken 66041, Germany
martym,crocker,knoferle@coli.uni-sb.de
Abstract
Recent “visual worlds” studies, wherein
researchers study language in context by
monitoring eye-movements in a visual
scene during sentence processing, have re-
vealed much about the interaction of di-
verse information sources and the time
course of their influence on comprehen-
sion. In this study, five experiments that
trade off scene context with a variety of
linguistic factors are modelled with a Sim-
ple Recurrent Network modified to inte-
grate a scene representation with the stan-
dard incremental input of a sentence. The
results show that the model captures the
qualitative behavior observed during the
experiments, while retaining the ability to
develop the correct interpretation in the
absence of visual input.
1 Introduction
People learn language within the context of the sur-
rounding world, and use it to refer to objects in that
world, as well as relationships among those objects
(e.g., Gleitman, 1990). Recent research in the vi-
sual worlds paradigm, wherein participants’ gazes
in a scene while listening to an utterance are moni-
tored, has yielded a number of insights into the time
course of sentence comprehension. The careful ma-
nipulation of information sources in this experimen-
tal setting has begun to reveal important character-
istics of comprehension such as incrementality and
anticipation. For example, people’s attention to ob-
jects in a scene closely tracks their mention in a spo-
ken sentence (Tanenhaus et al., 1995), and world
and linguistic knowledge seem to be factors that fa-
cilitate object identification (Altmann and Kamide,
1999; Kamide et al., 2003). More recently, Knoe-
ferle et al. (2005) have shown that when scenes in-
clude depicted events, such visual information helps
to establish important relations between the entities,
such as role relations.
Models of sentence comprehension to date, how-
ever, continue to focus on modelling reading behav-
ior. No model, to our knowledge, attempts to ac-
count for the use of immediate (non-linguistic) con-
text. In this paper we present results from two simu-
lations using a Simple Recurrent Network (SRN; El-
man, 1990) modified to integrate input from a scene
with the characteristic incremental processing of
such networks in order to model people’s ability to
adaptively use the contextual information in visual
scenes to more rapidly interpret and disambiguate a
sentence. In the modelling of five visual worlds ex-
periments reported here, accurate sentence interpre-
tation hinges on proper case-role assignment to sen-
tence referents. In particular, modelling is focussed
on the following aspects of sentence processing:
• anticipation of upcoming arguments and their
roles in a sentence
• adaptive use of the visual scene as context for a
spoken utterance
• influence of depicted events on developing in-
terpretation
• multiple/conflicting information sources and
their relative importance
36
Figure 1: Selectional Restrictions. Gaze fixations depend
on whether the hare is the subject or object of the sentence,
as well as the thematic role structure of the verb. These gaze
fixations reveal that people use linguistic and world knowledge
to anticipate upcoming arguments.
2 Simulation 1
In the first simulation, we simultaneously model
four experiments that featured revealing contrasts
between world knowledge and context. These four
experiments show that the human sentence proces-
sor is very adept at utilizing all available sources of
information to rapidly interpret language. In partic-
ular, information from visual context can readily be
integrated with linguistic and world knowledge to
disambiguate argument roles where the information
from the auditory stream is insufficient in itself.
All experiments were conducted in German, a
language that allows both subject-verb-object (SVO)
and object-verb-subject (OVS) sentence types, so
that word order alone cannot be relied upon to deter-
mine role assignments. Rather, case marking in Ger-
man is used to indicate grammatical function such
as subject or object, except in the case of feminine
and neuter nouns where the article does not carry
any distinguishing marking for the nominative and
accusative cases.
2.1 Anticipation depending on stereotypicality
The first two experiments modelled involved unam-
biguous sentences in which case-marking and verb
selectional restrictions in the linguistic input (i.e.,
linguistic and world knowledge or stereotypicality),
together with characters depicted in a visual scene,
allowed rapid assignment of the roles played by
those characters in the sentence.
Experiment 1: Morphosyntactic and lexical
verb information. In order to examine the influence
of linguistic knowledge of case-marking, Kamide
et al. (2003) presented experiment participants with
a scene showing, for example, a hare, a cabbage, a
fox, and a distractor (see Figure 1), together with ei-
ther a spoken German SVO sentence (1) or with an
OVS sentence (2):
(1) Der Hase frisst gleich den Kohl.
The harenom eats shortly the cabbageacc.
(2) Den Hasen frisst gleich der Fuchs.
The hareacc eats shortly the foxnom.
The subject and object case-marking on the article of
the first noun phrase (NP) together with verb mean-
ing and world knowledge allowed anticipation of the
correct post-verbal referent. Participants made an-
ticipatory eye-movements to the cabbage after hear-
ing “The harenom eats ...” and to the fox upon en-
countering “The hareacc eats ...”. Thus, people are
able to predict upcoming referents when the utter-
ance is unambiguous and linguistic/world knowl-
edge restricts the domain of potential referents in a
scene.
Experiment 2: Verb type information. To fur-
ther investigate the role of verb information, the
authors used the same visual scenes in a follow-
up study, but replaced the agent/patient verbs like
frisst (“eats”) with experiencer/theme verbs like in-
teressiert (“interests”). The agent (experiencer) and
patient (theme) roles from Experiment 1 were inter-
changed. Given the same scene in Figure 1 but the
subject-first sentence (3) or object-first sentence (4),
participants showed gaze fixations complementary
to those in the first experiment, confirming that both
syntactic case information and semantic verb infor-
mation are used to predict subsequent referents.
(3) Der Hase interessiert ganz besonders den Fuchs.
The harenom interests especially the foxacc.
(4) Den Hasen interessiert ganz besonders der Kohl.
The hareacc interests especially the cabbagenom.
2.2 Anticipation depending on depicted events
The second set of experiments investigated tem-
porarily ambiguous German sentences. Findings
showed that depicted events–just like world and lin-
guistic knowledge in unambiguous sentences–can
establish a scene character’s role as agent or patient
in the face of linguistic structural ambiguity.
37
Figure 2: Depicted Events. The depiction of actions allows
role information to be extracted from the scene. People can use
this information to anticipate upcoming arguments even in the
face of ambiguous linguistic input.
Experiment 3: Verb-mediated depicted role re-
lations. Knoeferle et al. (2005) investigated com-
prehension of spoken sentences with local structural
and thematic role ambiguity. An example of the Ger-
man SVO/OVS ambiguity is the SVO sentence (5)
versus the OVS sentence (6):
(5) Die Princessin malt offensichtlich den Fechter.
The princessnom paints obviously the fenceracc.
(6) Die Princessin w¨ascht offensichtlich der Pirat.
The princessacc washes obviously the piratenom.
Together with the auditorily presented sentence a
scene was shown in which a princess both paints
a fencer and is washed by a pirate (see Figure 2).
Linguistic disambiguation occurred on the second
NP; in the absence of stereotypical verb-argument
relationships, disambiguation prior to the second NP
was only possible through use of the depicted events
and their associated depicted role relations. When
the verb identified an action, the depicted role rela-
tions disambiguated towards either an SVO agent-
patient (5) or OVS patient-agent role (6) relation, as
indicated by anticipatory eye-movements to the pa-
tient (pirate) or agent (fencer), respectively, for (5)
and (6). This gaze-pattern showed the rapid influ-
ence of verb-mediated depicted events on the assign-
ment of a thematic role to a temporarily ambiguous
sentence-initial noun phrase.
Experiment 4: Weak temporal adverb con-
straint. Knoeferle et al. also investigated German
verb-final active/passive constructions. In both the
active future-tense (7) and the passive sentence (8),
the initial subject noun phrase is role-ambiguous,
and the auxiliary wird can have a passive or future
interpretation.
(7) Die Princessin wird sogleich den Pirat washen.
The princessnom will right away wash the pirateacc.
(8) Die Princessin wird soeben von dem Fechter gemalt.
The princessacc is just now painted by the fencernom.
To evoke early linguistic disambiguation, temporal
adverbs biased the auxiliary wird toward either the
future (“will”) or passive (“is -ed”) reading. Since
the verb was sentence-final, the interplay of scene
and linguistic cues (e.g., temporal adverbs) were
rather more subtle. When the listener heard a future-
biased adverb such as sogleich, after the auxiliary
wird, he interpreted the initial NP as an agent of a fu-
ture construction, as evidenced by anticipatory eye-
movements to the patient in the scene. Conversely,
listeners interpreted the passive-biased construction
with these roles exchanged.
2.3 Architecture
The Simple Recurrent Network is a type of neu-
ral network typically used to process temporal se-
quences of patterns such as words in a sentence.
A common approach is for the modeller to train
the network on prespecified targets, such as verbs
and their arguments, that represent what the net-
work is expected to produce upon completing a sen-
tence. Processing is incremental, with each new in-
put word interpreted in the context of the sentence
processed so far, represented by a copy of the pre-
vious hidden layer serving as additional input to the
current hidden layer. Because these types of asso-
ciationist models automatically develop correlations
among the sentence constituents they are trained
on, they will generally develop expectations about
the output even before processing is completed be-
cause sufficient information occurs early in the sen-
tence to warrant such predictions. Moreover, during
the course of processing a sentence these expecta-
tions can be overridden with subsequent input, often
abruptly revising an interpretation in a manner rem-
iniscent of how humans seem to process language.
Indeed, it is these characteristics of incremental pro-
cessing, the automatic development of expectations,
seamless integration of multiple sources of informa-
tion, and nonmonotonic revision that have endeared
neural network models to cognitive researchers.
In this study, the four experiments described
38
hidden layer
context layer
event layers
waeschtPrinzessinPirat PAT
input layer
waescht
Figure 3: Scene Integration. A simple conceptual rep-
resentation of the information in a scene, along with com-
pressed event information from depicted actions when present,
is fed into a standard SRN to model adaptive processing. The
links connecting the depicted characters to the hidden layer are
shared, as are the links connecting the event layers to the hidden
layer.
above have been modelled simultaneously using a
single network. The goal of modelling all experi-
mental results by a single architecture required en-
hancements to the SRN, the development and pre-
sentation of the training data, as well as the training
regime itself. These will be described in turn below.
In two of the experiments, only three characters
are depicted, representation of which can be propa-
gated directly to the network’s hidden layer. In the
other two experiments, the scene featured three char-
acters involved in two events (e.g., pirate-washes-
princess and princess-paints-fencer, as shown in
Figure 3). The middle character was involved in
both events, either as an agent or a patient (e.g.,
princess). Only one of the events, however, corre-
sponded to the spoken linguistic input.
The representation of this scene information and
its integration into the model’s processing was the
main modification to the SRN. Connections between
representations for the depicted characters and the
hidden layer were provided. Encoding of the de-
picted events, when present, required additional
links from the characters and depicted actions to
event layers, and links from these event layers to the
SRN’s hidden layer. The network developed repre-
sentations for the events in the event layers by com-
pressing the scene representations of the involved
characters and depicted actions through weights cor-
responding to the action, its agent and its patient for
each event. This event representation was kept sim-
ple and only provided conceptual input to the hidden
layer: who did what to whom was encoded for both
events, when depicted, but grammatical information
only came from the linguistic input. As the focus of
this study was on whether sentence processing could
adapt to information from the scene when present or
from stored knowledge, lower-level perceptual pro-
cesses such as attention were not modelled.
Neural networks will usually encode any correla-
tions in the data that help to minimize error. In order
to prevent the network from encoding regularities in
its weights regarding the position of the characters
and events given in the scene (such as, for example,
that the central character in the scene corresponds
to the first NP in the presented sentence) which are
not relevant to the role-assignment task, one set of
weights was used for all characters, and another set
of weights used for both events. This weight-sharing
ensured that the network had to access the informa-
tion encoded in the event layers, or determine the
relevant characters itself, thus improving generaliza-
tion. The representations for the characters and ac-
tions were the same for both input (scene and sen-
tence) and output.
The input assemblies were the scene represen-
tations and the current word from the input sen-
tence. The output assemblies were the verb, the
first and second nouns, and an assembly that indi-
cated whether the first noun was the agent or pa-
tient of the sentence (token PAT in Figure 3). Typ-
ically, agent and patient assemblies would be fixed
in a case-role representation without such a discrim-
inator, and the model required to learn to instantiate
them correctly (Miikkulainen, 1997). However, we
found that the model performed much better when
the task was recast as having to learn to isolate the
nouns in the order in which they are introduced, and
separately mark how those nouns relate to the verb.
The input and output assemblies had 100 units each,
the event layers contained 200 units each, and the
hidden and context layers consisted of 400 units.
39
2.4 Input Data, Training, and Experiments
We trained the network to correctly handle sentences
involving non-stereotypical events as well as stereo-
typical ones, both when visual context was present
and when it was absent. As over half a billion sen-
tence/scene combinations were possible for all of the
experiments, we adopted a grammar-based approach
to exhaustively generate sentences and scenes based
on the experimental materials while holding out the
actual materials to be used for testing. In order to
accurately model the first two experiments involv-
ing selectional restrictions on verbs, two additional
words were added to the lexicon for each charac-
ter selected by a verb. For example, in the sentence
Der Hase frisst gleich den Kohl, the nouns Hase1,
Hase2, Kohl1, and Kohl2 were used to develop train-
ing sentences. These were meant to represent, for
example, words such as “rabbit” and “jackrabbit” or
“carrot” and “lettuce” in the lexicon that have the
same distributional properties as the original words
“hare” and “cabbage”. With these extra tokens the
network could learn that Hase, frisst, and Kohl were
correlated without ever encountering all three words
in the same training sentence. The experiments in-
volving non-stereotypicality did not pose this con-
straint, so training sentences were simply generated
to avoid presenting experimental items.
Some standard simplifications to the words have
been made to facilitate modelling. For example,
multi-word adverbs such as fast immer were treated
as one word through hyphenation so that sentence
length within a given experimental set up is main-
tained. Nominal case markings such as -n in Hasen
were removed to avoid sparse data as these markings
are idiosyncratic, while the case markings on the de-
terminers are more informative overall. More impor-
tantly, morphemes such as the infinitive marker -en
and past participle ge- were removed, because, for
example, the verb forms malt, malen, and gemalt,
would all be treated as unrelated tokens, again con-
tributing unnecessarily to the problem with sparse
data. The result is that one verb form is used, and
to perform accurately, the network must rely on its
position in the sentence (either second or sentence-
final), as well as whether the word von occurs to
indicate a participial reading rather than infinitival.
All 326 words in the lexicon for the first four exper-
a0a1a0
a0a1a0
a2a1a2
a2a1a2
a3a1a3
a3a1a3
a4a1a4
a4a1a4
a5a1a5a1a5a1a5
a5a1a5a1a5a1a5
a5a1a5a1a5a1a5
a5a1a5a1a5a1a5
a5a1a5a1a5a1a5
a5a1a5a1a5a1a5
a5a1a5a1a5a1a5
a5a1a5a1a5a1a5
a5a1a5a1a5a1a5
a5a1a5a1a5a1a5
a6a1a6a1a6
a6a1a6a1a6
a6a1a6a1a6
a6a1a6a1a6
a6a1a6a1a6
a6a1a6a1a6
a6a1a6a1a6
a6a1a6a1a6
a6a1a6a1a6
a6a1a6a1a6
a7a1a7a1a7a1a7
a7a1a7a1a7a1a7
a7a1a7a1a7a1a7
a7a1a7a1a7a1a7
a7a1a7a1a7a1a7
a7a1a7a1a7a1a7
a7a1a7a1a7a1a7
a7a1a7a1a7a1a7
a7a1a7a1a7a1a7
a7a1a7a1a7a1a7
a8a1a8a1a8
a8a1a8a1a8
a8a1a8a1a8
a8a1a8a1a8
a8a1a8a1a8
a8a1a8a1a8
a8a1a8a1a8
a8a1a8a1a8
a8a1a8a1a8
a8a1a8a1a8
a9a1a9a1a9
a9a1a9a1a9
a9a1a9a1a9
a9a1a9a1a9
a9a1a9a1a9
a9a1a9a1a9
a9a1a9a1a9
a9a1a9a1a9
a9a1a9a1a9
a10a1a10a1a10
a10a1a10a1a10
a10a1a10a1a10
a10a1a10a1a10
a10a1a10a1a10
a10a1a10a1a10
a10a1a10a1a10
a10a1a10a1a10
a10a1a10a1a10
a11a1a11a1a11
a11a1a11a1a11
a11a1a11a1a11
a11a1a11a1a11
a11a1a11a1a11
a11a1a11a1a11
a11a1a11a1a11
a11a1a11a1a11
a12a1a12a1a12
a12a1a12a1a12
a12a1a12a1a12
a12a1a12a1a12
a12a1a12a1a12
a12a1a12a1a12
a12a1a12a1a12
a12a1a12a1a12
a13a1a13a1a13a1a13
a13a1a13a1a13a1a13
a13a1a13a1a13a1a13
a13a1a13a1a13a1a13
a13a1a13a1a13a1a13
a13a1a13a1a13a1a13
a13a1a13a1a13a1a13
a14a1a14a1a14a1a14
a14a1a14a1a14a1a14
a14a1a14a1a14a1a14
a14a1a14a1a14a1a14
a14a1a14a1a14a1a14
a14a1a14a1a14a1a14
a14a1a14a1a14a1a14
a15a1a15a1a15
a15a1a15a1a15
a15a1a15a1a15
a15a1a15a1a15
a15a1a15a1a15
a15a1a15a1a15
a15a1a15a1a15
a15a1a15a1a15
a15a1a15a1a15
a16a1a16a1a16
a16a1a16a1a16
a16a1a16a1a16
a16a1a16a1a16
a16a1a16a1a16
a16a1a16a1a16
a16a1a16a1a16
a16a1a16a1a16
a16a1a16a1a16
a17a1a17a1a17
a17a1a17a1a17
a17a1a17a1a17
a17a1a17a1a17
a17a1a17a1a17
a17a1a17a1a17
a17a1a17a1a17
a17a1a17a1a17
a17a1a17a1a17
a17a1a17a1a17
a18a1a18a1a18
a18a1a18a1a18
a18a1a18a1a18
a18a1a18a1a18
a18a1a18a1a18
a18a1a18a1a18
a18a1a18a1a18
a18a1a18a1a18
a18a1a18a1a18
a18a1a18a1a18
a19a1a19a1a19a1a19
a19a1a19a1a19a1a19
a19a1a19a1a19a1a19
a19a1a19a1a19a1a19
a19a1a19a1a19a1a19
a19a1a19a1a19a1a19
a19a1a19a1a19a1a19
a19a1a19a1a19a1a19
a19a1a19a1a19a1a19
a20a1a20a1a20a1a20
a20a1a20a1a20a1a20
a20a1a20a1a20a1a20
a20a1a20a1a20a1a20
a20a1a20a1a20a1a20
a20a1a20a1a20a1a20
a20a1a20a1a20a1a20
a20a1a20a1a20a1a20
a20a1a20a1a20a1a20
 85
 90
 95
 100
Exp 1 Exp 2 Exp 3 Exp 4
Percentage Correct
Adverb
NP2
Figure 4: Results. In each of the four experiments modelled,
anticipation of the upcoming argument at the adverb is nearly
as accurate as at sentence end. However, the network has some
difficulty with distinguishing stereotypical arguments.
iments were given random representations over the
vertices of a 100-dimensional hypercube, which re-
sulted in marked improvement over sampling from
within the hypercube (Noelle et al., 1997).
We trained the network by repeatedly presenting
the model with 1000 randomly generated sentences
from each experiment (constituting one epoch) and
testing every 100 epochs against the held-out test
materials for each of the four experiments. Scenes
were provided half of the time to provide an un-
biased approximation to linguistic experience. The
network was initialized with weights between -0.01
and 0.01. The learning rate was initially set to 0.05
and gradually reduced to 0.002 over the course of
15000 epochs. Ten splits were run on 1.6Ghz PCs
and took a little over two weeks to complete.
2.5 Results
Figure 4 reports the percentage of targets at the
network’s output layer that the model correctly
matches, both as measured at the adverb and at the
end of the sentence. The model clearly demonstrates
the qualitative behavior observed in all four experi-
ments in that it is able to access the information from
the encoded scene or stereotypicality and combine it
with the incrementally presented sentence to antici-
pate forthcoming arguments.
For the two experiments (1 and 2) using stereotyp-
ical information, the network achieved just over 96%
at sentence end, and anticipation accuracy was just
over 95% at the adverb. Analysis shows that the net-
work makes errors in token identification, confus-
ing words that are within the selectionally restricted
40
set, such as, for example, Kohl and Kohl2. Thus,
the model has not quite mastered the stereotypical
knowledge, particularly as it relates to the presence
of the scene.
For the other two experiments using non-
stereotypical characters and depicted events (exper-
iments 3 and 4), accuracy was 100% at the end of
the sentence. More importantly, the model achieved
over 98% early disambiguation on experiment 3,
where the sentences were simple, active SVO and
OVS. Early disambiguation on experiment 4 was
somewhat harder because the adverb is the disam-
biguating point in the sentence as opposed to the
verb in the other three experiments. As nonlinear
dynamical systems, neural networks sometimes re-
quire an extra step to settle after a decision point is
reached due to the attractor dynamics of the weights.
On closer inspection of the model’s behavior dur-
ing processing, it is apparent that the event layers
provide enough additional information beyond that
encoded in the weights between the characters and
the hidden layer that the model is able to make finer
discriminations in experiments 3 and 4, enhancing
its performance.
3 Simulation 2
The previous set of experiments examined how peo-
ple are able to use either stereotypical knowledge or
depicted information to anticipate forthcoming ar-
guments in a sentence. But how does the human
sentence processor handle these information sources
when both are present? Which takes precedence
when they conflict? The experiment modelled in this
section was designed to provide some insight into
these questions.
Scene vs Stored Knowledge. Based on the find-
ings from the four experiments in Simulation 1,
Knoeferle and Crocker (2004b) examined two is-
sues. First, it verified that stored knowledge about
non-depicted events and information from depicted,
but non-stereotypical, events each enable rapid the-
matic interpretation. An example scene showed a
wizard, a pilot, and a detective serving food (Fig-
ure 5). When people heard condition 1 (example
sentence 9), the case-marking on the first NP identi-
fied the pilot as a patient. Stereotypical knowledge
identified the wizard as the only relevant agent, as
Figure 5: Scene vs Stored Knowledge. Experimental results
show that people rely on depicted information over stereotypical
knowledge when both are present during sentence processing.
indicated by a higher proportion of anticipatory eye-
movements to the stereotypical agent (wizard) than
to the detective. In contrast, when people heard the
verb in condition 2 (sentence 10), it uniquely iden-
tified the detective as the only food-serving agent,
revealed by more inspections to the agent of the de-
picted event (detective) than to the wizard.
(9) Den Piloten verzaubert gleich der Zauberer.
The pilotacc jinxes shortly the wizardnom.
(10) Den Piloten verk¨ostigt gleich der Detektiv.
The pilotacc serves-food-to shortly the detectivenom.
Second, the study determined the relative impor-
tance of depicted events and verb-based thematic
role knowledge when the information sources were
in competition. In both conditions 3 & 4 (sentences
11 & 12), participants heard an utterance in which
the verb identified both a stereotypical (detective)
and a depicted agent (wizard). When faced with this
conflict, people preferred to rely on the immediate
event depictions over stereotypical knowledge, and
looked more often at the wizard, the agent in the de-
picted event, than at the other, stereotypical agent of
the spying-action (the detective).
(11) Den Piloten bespitzelt gleich der Detektiv.
The pilotacc spies-on shortly the detectivenom.
(12) Den Piloten bespitzelt gleich der Zauberer.
The pilotacc spies-on shortly the wizardnom.
3.1 Architecture, Data, Training, and Results
In simulation 1, we modelled experiments that de-
pended on stereotypicality or depicted events, but
not both. The experiment modelled in simulation
2, however, was specifically designed to investigate
41
how these two information sources interacted. Ac-
cordingly, the network needed to learn to use either
information from the scene or stereotypicality when
available, and, moreover, favor the scene when the
two sources conflicted, as observed in the empirical
results. Recall that the network is trained only on the
final interpretation of a sentence. Thus, capturing
the observed behavior required manipulation of the
frequencies of the four conditions described above
during training. In order to train the network to de-
velop stereotypical agents for verbs, the frequency
that a verb occurs with its stereotypical agent, such
as Detektiv and bespitzelt from example (11) above,
had to be greater than for a non-stereotypical agent.
However, the frequency should not be so great that
it overrode the influence from the scene.
The solution we adopted is motivated by a the-
ory of language acquisition that takes into account
the importance of early linguistic experience in a vi-
sual environment (see the General Discussion). We
found a small range of ratios of stereotypicality to
non-stereotypicality that permitted the network to
develop an early reliance on information from the
scene while it gradually learned the stereotypical as-
sociations. When the ratio was lower than 6:1, the
network developed too strong a reliance on stereo-
typicality, overriding information from the scene.
When the ratio was greater than 15:1, the scene
always took precedence when it was present, but
stereotypical knowledge was used when the scene
was not present. Within this range, however, the
network quickly learns to extract information from
the scene because the scene representation remains
static while a sentence is processed incrementally.
It is the stereotypical associations, predictably, that
take longer for the network to learn in rough propor-
tion to their ratio over non-stereotypical agents.
Figure 6 shows the effect this training regime had
over 6000 epochs on the ability of the network to ac-
curately anticipate the missing argument in each of
the four conditions described above when the ratio
of non-stereotypical to stereotypical sentences was
8:1. The network quickly learns to use the scene for
conditions 2-4 (examples 10-12), where the action in
the linguistic input stream is also depicted, allowing
the network to determine the relevant event and de-
duce the missing argument. (Because conditions 3
and 4 are the same up to the second NP, their curves
. 0.7
 0.75
 0.8
 0.85
 0.9
 0.95
 1
 0  1000 2000 3000 4000 5000 6000
Percentage Cond 1Cond 2
Cond 3Cond 4
Figure 6: Acquisition of Stereotypicality. Stereotypical
knowledge (condition 1) is acquired much more gradually than
information from the scene (conditions 2-4).
are, in fact, identical.) But condition 1 (sentence 9)
requires only stereotypical knowledge. The accu-
racy of condition 1 remains close to 75% (correctly
producing the verb, first NP, and role discriminator,
but not the second NP) until around epoch 1200 or
so and then gradually improves as the network learns
the appropriate stereotypical associations. The con-
dition 1 curve asymptotically approaches 100% over
the course of 10,000 epochs.
Results from several runs with different training
parameters (such as learning rate and stereotypical-
ity ratio) show that the network does indeed model
the observed experimental behavior. The best results
so far exceed 99% accuracy in correctly anticipating
the proper roles and 100% accuracy at sentence end.
As in simulation 1, the training corpus was gen-
erated by exhaustively combining participants and
actions for all experimental conditions while hold-
ing out all test sentences. However, we found that
we were able to use a larger learning rate, 0.1, than
the 0.05 used in the first simulation. The 130 words
in the lexicon were given random binary representa-
tions from the vertices of a 100-dimensional hyper-
cube as described before.
Analysis of the network after successful training
suggests why the training regime of holding the ratio
of stereotypical to non-stereotypical sentences con-
stant works. Early in training, before stereotypical-
ity has been encoded in the network’s weights, pat-
terns are developed in the hidden layer as each word
is processed that enable the network to accurately
decode the words in the output layer. Once the verb
is read in, its hidden layer pattern is available to pro-
42
duce the correct output representations for both the
verb itself and its stereotypical agent. Not surpris-
ingly, the network thus learns to associate the hidden
layer pattern for the verb with its stereotypical agent
pattern in the second NP output slot. The only con-
straint for the network is to ensure that the scene can
still override this stereotypicality when the depicted
event so dictates.
4 General Discussion and Future Work
Experiments in the visual worlds paradigm have
clearly reinforced the view of language comprehen-
sion as an active, incremental, highly integrative
process in which anticipation of upcoming argu-
ments plays a crucial role. Visual context not only
facilitates identification of likely referents in a sen-
tence, but helps establish relationships between ref-
erents and the roles they may fill. Research thus far
has shown that the human sentence processor seems
to have easy access to whatever information is avail-
able, whether it be syntactic, lexical, semantic, or vi-
sual, and that it can combine these sources to achieve
as complete an interpretation as is possible at any
given point in comprehending a sentence.
The modelling results reported in this paper are an
important step toward the goal of understanding how
the human sentence processor is able to accomplish
these feats. The SRN provides a natural framework
for this research because its operation is premised
on incremental and integrative processing. Trained
simply to produce a representation of the complete
interpretation of a sentence as each new word is pro-
cessed (on the view that people learn to process lan-
guage by reviewing what they hear), the model au-
tomatically develops anticipations for upcoming ar-
guments that allow it to demonstrate the early dis-
ambiguation behavior observed in the visual worlds
experiments modelled here.
The simple accuracy results belie the complex-
ity of the task in both simulations. In Simulation
1, the network has to demonstrate early disambigua-
tion when the scene is present, showing that it can
indeed access the proper role and filler from the
compressed representation of the event associated
with the first NP and verb processed in the linguistic
stream. This task is rendered more difficult because
the proper event must be extracted from the super-
imposition of the two events in the scene, which is
what is propagated into the model’s hidden layer. In
addition, it must also still be able to process all sen-
tences correctly when the scene is not present.
Simulation 2 is more difficult still. The experi-
ment shows that information from the scene takes
precedence when there is a conflict with stereotypi-
cal knowledge; otherwise, each source of knowledge
is used when it is available. In the training regime
used in this simulation, the dominance of the scene
is established early because it is much more fre-
quent than the more particular stereotypical knowl-
edge. As training progresses, stereotypical knowl-
edge is gradually learned because it is sufficiently
frequent for the network to capture the relevant as-
sociations. As the network weights gradually satu-
rate, it becomes more difficult to retune them. But
encoding stereotypical knowledge requires far fewer
weight adjustments, so the network is able to learn
that task later during training.
Knoeferle and Crocker (2004a,b) suggest that the
preferred reliance of the comprehension system on
the visual context over stored knowledge might best
be explained by appealing to a boot-strapping ac-
count of language acquisition such as that of Gleit-
man (1990). The development of a child’s world
knowledge occurs in a visual environment, which
accordingly plays a prominent role during language
acquisition. The fact that the child can draw on two
informational sources (utterance and scene) enables
it to infer information that it has not yet acquired
from what it already knows. This contextual devel-
opment may have shaped both our cognitive archi-
tecture (i.e., providing for rapid, seamless integra-
tion of scene and linguistic information), and com-
prehension mechanisms (e.g., people rapidly avail
themselves of information from the immediate scene
when the utterance identifies it).
Connectionist models such as the SRN have been
used to model aspects of cognitive development, in-
cluding the timing of emergent behaviors (Elman
et al., 1996), making them highly suitable for sim-
ulating developmental stages in child language ac-
quisition (e.g., first learning names of objects in the
immediate scene, and later proceeding to the acqui-
sition of stereotypical knowledge). If there are de-
velopmental reasons for the preferred reliance of lis-
teners on the immediate scene during language com-
43
prehension, then the finding that modelling that de-
velopment provides the most efficient (if not only)
way to naturally reproduce the observed experimen-
tal behavior promises to offer deeper insight into
how such knowledge is instilled in the brain.
Future research will focus on combining all of the
experiments in one model, and expand the range of
sentence types and fillers to which the network is
exposed. The architecture itself is being redesigned
to scale up to much more complex linguistic con-
structions and have greater coverage while retaining
the cognitively plausible behavior described in this
study (Mayberry and Crocker, 2004).
5 Conclusion
We have presented a neural network architecture that
successfully models the results of five recent exper-
iments designed to study the interaction of visual
context with sentence processing. The model shows
that it can adaptively use information from the vi-
sual scene such as depicted events, when present,
to anticipate roles and fillers as observed in each of
the experiments, as well as demonstrate traditional
incremental processing when context is absent. Fur-
thermore, more recent results show that training the
network in a visual environment, with stereotypical
knowledge gradually learned and reinforced, allows
the model to negotiate even conflicting information
sources.
6 Acknowledgements
This research was funded by SFB 378 project “AL-
PHA” to the first two authors and a PhD scholar-
ship to the last, all awarded by the German Research
Foundation (DFG).

References
Altmann, G. T. M. and Kamide, Y. (1999). Incre-
mental interpretation at verbs: Restricting the do-
main of subsequent reference. Cognition, 73:247–
264.
Elman, J. L. (1990). Finding structure in time. Cog-
nitive Science, 14:179–211.
Elman, J. L., Bates, E. A., Johnson, M. H.,
Karmiloff-Smith, A., Parisi, D., and Plunkett, K.
(1996). Rethinking Innateness: A Connectionist
Perspective on Development. MIT Press, Cam-
bridge, MA.
Gleitman, L. (1990). The structural sources of verb
meanings. Language Acquisition, 1:3–55.
Kamide, Y., Scheepers, C., and Altmann, G. T. M.
(2003). Integration of syntactic and seman-
tic information in predictive processing: Cross-
linguistic evidence from German and English.
Journal of Psycholinguistic Research, 32(1):37–
55.
Knoeferle, P. and Crocker, M. W. (2004a). The co-
ordinated processing of scene and utterance: ev-
idence from eye-tracking in depicted events. In
Proceedings of International Conference on Cog-
nitive Science, Allahabad, India.
Knoeferle, P. and Crocker, M. W. (2004b). Stored
knowledge versus depicted events: what guides
auditory sentence comprehension. In Proceedings
of the 26th Annual Conference of the Cognitive
Science Society. Mahawah, NJ: Erlbaum. 714–
719.
Knoeferle, P., Crocker, M. W., Scheepers, C., and
Pickering, M. J. (2005). The influence of the im-
mediate visual context on incremental thematic
role-assignment: evidence from eye-movements
in depicted events. Cognition, 95:95–127.
Mayberry, M. R. and Crocker, M. W. (2004). Gen-
erating semantic graphs through self-organization.
In Proceedings of the AAAI Symposium on Com-
positional Connectionism in Cognitive Science,
pages 40–49, Washington, D.C.
Miikkulainen, R. (1997). Natural language process-
ing with subsymbolic neural networks. In Browne,
A., editor, Neural Network Perspectives on Cogni-
tion and Adaptive Robotics, pages 120–139. Insti-
tute of Physics Publishing, Bristol, UK; Philadel-
phia, PA.
Noelle, D. C., Cottrell, G. W., and Wilms, F. (1997).
Extreme attraction: The benefits of corner attrac-
tors. Technical Report CS97-536, Department of
Computer Science and Engineering, UCSD, San
Diego, CA.
Tanenhaus, M. K., Spivey-Knowlton, M. J., Eber-
hard, K. M., and Sedivy, J. C. (1995). Integration
of visual and linguistic information in spoken lan-
guage comprehension. Science, 268:1632–1634.
