XML Viewer - w98-1228

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1228_metho.xml
Size: 30,719 bytes
Last Modified: 2025-10-06 14:15:12
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1228">
  <Title>Selective Attention and the Acquisition of Spatial Semantics</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Connectionist Modelling and the
L0 Project
</SectionTitle>
    <Paragraph position="0"> Advances in brain sciences and information technology in recent decades have allowed the development of sophisticated models of cognitive processes at a Hogan, Diederich and Finn 235 Selective dttention and Acquisition of Spatial Semantics James M. Hogan, Joachim Diedcrich and Gerard D. Finn (1998) Selective Attention and the Acquisition of Spatial Semantics. In D.M.W. Powers (ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural Language Learning, ACL, pp 235-244. number of levels of abstraction. Noting the domain-specific nature of much of this work, and the importance of integration of disparate cognitive machinery in the long-term development of the discipline, (Feldman et. al., 1990) proposed Lo as a &amp;quot;touchstone \[task\] for cognitive science&amp;quot;, requiring elements of visual perception, natural language modelling, and learning. As originally stated, the Lo task is to construct a computer system to perform Miniature Language Acquisition, without reliance upon &amp;quot;forthcoming results in related domains&amp;quot; to resuscitate an otherwise inadequate model: The system is given examples of pictures paired with true statements about those pictures in an arbitrary natural language.</Paragraph>
    <Paragraph position="1"> The system is to learn the relevant portion of the language well enough so that given a novel sentence of that language, it can determine whether or not the sentence is true of the accompanying picture.</Paragraph>
    <Paragraph position="2"> The system is further constrained by the substantial variations known to exist across natural languages in their characterisation of space - eliminating ad hoc computational mechanisms - and by the assumption that learning must simulate childhood language acquisition in the exclusion of explicit negative evidence (see for example (Chomsky, 1965)). Thus only positive instances of a given concept may be presented during training, but the system may receive negative examples during normal operation.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 A Semantic Sub-Task
</SectionTitle>
      <Paragraph position="0"> The L0 sub-task examined by (Regier, 1992) requires that the model system acquire &amp;quot;perceptually grounded semantics of natural language spatial terms&amp;quot;. Each lexeme describes a locative relationship between a special (potentially mobile) object known as the trajector (TR) and a static reference object known as the landmark (LM) (Langacker, 1987). Figure 1 shows a positive example for the English lexeme 'above'. In essence, spatial semantics define a partitioning of the set of object pictures into classes prescribed by the underlying natural language. The task of the model system is then to learn this classification from positive examples of each category, forming a recognition system for each class of pictures 1. In English, these labels might include the 1However, examples may belong to a number of categories, and some gradation of class membership is desirable as some scenes are better, more prototypical exemplars of a given concept than others. See chapter 2 of (Regier, 1992) for discussion of this issue.</Paragraph>
      <Paragraph position="1">  which one might associate the English fragment &amp;quot;circle above square&amp;quot;.</Paragraph>
      <Paragraph position="2"> static concepts: above, below, left, right, in, out, on, and off; and their dynamic equivalents: above, below, left, right, around, in, on, out of, through, and over.</Paragraph>
      <Paragraph position="3"> System input is provided in the form of a two-dimensional bitmap (static concepts) or sequence of bitmaps (hereafter a &amp;quot;movie&amp;quot;, for dynamic concepts) usually showing only the 'line-drawn' LM and TR in a position exemplifying the concept, although Regier does also consider more complicated phenomena such as deixis a. The task is thus simplified so as to limit issues of object detection (through avoidance of feature-laden scene backgrounds and object interiors), confusion due to distractors, and to disregard the role of luminance and colour.</Paragraph>
      <Paragraph position="4"> Nevertheless, computational approaches are greatly constrained by many years of research in a number of disciplines, rendering the task of feature extraction and encoding non-trivial. While a cognitive model need not replicate all aspects of the underlying neural substrate, it gains in plausibility if it supports classifications based upon processing in cortical areas known to be active during performance of the given task. Thus, some functional replication of neural pathways - ostensibly at the level of systems neuroscience (Churchland and  Sejnowksi, 1992) - becomes an essential aspect of architectural design, and this is more readily accomplished through a top down approach.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Regler Model
</SectionTitle>
      <Paragraph position="0"> (Regier, 1992) implemented highly structured connectionist systems for both the static and dynamic concept classes discussed above - the dynamic system incorporating the single frame processing capabilities of the static system. Concepts were represented in terms of directional ~ and non-directional 4 features computed from the image, system pre-processing providing the output network with real-valued encodings for each feature value. In contrast to the present model, objects are tagged as LM or TR tokens as part of the input representation, and the image is partitioned into separate TR and LM bitmaps as part of pre-processing.</Paragraph>
      <Paragraph position="1"> Computationally, the Regier system may be viewed within the framework of &amp;quot;partially structured connectionism&amp;quot; (Feldman et. al., 1988), in which systems level architectural design is coupled with unstructured local networks which may be trained to perform (initially unspecified) functions so as to realist an overall system task - although this description understates the specificity of some model subsystems s.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Discussion of the Regier Model
</SectionTitle>
      <Paragraph position="0"> It is well-accepted that perceptual representations may rely upon independent encodings of object features and properties in distinct anatomical areas, and that some mechanism is then required to associate or bind the representations together to facilitate processing of a particular object instantiation (Treisman, 1996). This observation is best illustrated by the separation of the object recognition (variously the 'what '/'object '/'ventral' /'occipito-temporal') and location ('where'/'spatial'/'dorsal'/'occipitoparietal') pathways of the visual system. Sin particular, the orientation of a line connecting points of closest approach, and of that connecting the centres of mass.</Paragraph>
      <Paragraph position="1"> 4For example surface contact or inclusion.</Paragraph>
      <Paragraph position="2"> Sin the original L0 paper, (Feldman et. al., 1990) noted the difficulty in balancing the facilitation of learning provided by &amp;quot;innate structures&amp;quot; (in computational modelling a top down approach) against the potential generality of relatively unstructured networks. Notwithstanding the apparent structural sophistication of the Regier model - perhaps motivated in part by difficulties in parameter adjustment with limited training sets - the choice of feature extraction machinery was in this case sufficiently general to allow lexeme acquisition across a variety of natural languages.</Paragraph>
      <Paragraph position="3"> While Regier carefully positioned his model clear of any controversy over correspondence with biological structures, its architecture must ultimately be viewed as an abstraction of the 'where' (dorsal) pathway, the need for object recognition being reduced through explicit tagging of the input data.</Paragraph>
      <Paragraph position="4"> Although spatial relations are implicitly determined by the position of the objects in an example image, equally valid but semantically distinct (perhaps antonymic) characterisations of the scene may be made depending upon the selection of trajector and landmark. Figure 1, for example, may be regarded as prototypical example of both above (&amp;quot;Circle above square&amp;quot;; TR=Circle;LM---Square) or below (&amp;quot;Square below Circle&amp;quot;; TR=Square: LM=Circle).</Paragraph>
      <Paragraph position="5"> Identification of TR and LM is thus critical in the selection of the appropriate lexeme, and correct tagging appears to require association of an object name and internal representation sufficient to facilitate visual search, and a language-specific comprehension of the syntactic relationship between the TK,LM, and lexeme s. It is our contention that childhood acquisition of spatial semantics is dependent upon sufficient facility in the native language to perform this object-tagging, through parsing of spoken language fragments associated with the image.</Paragraph>
      <Paragraph position="6"> It is thought (Crystal, 1995),(Khanji and Weist, 1996) that acquisition of elementary spatial lexemes takes place soon after the ~naming explosion&amp;quot; of the second year of life (Woodward et. al., 1994), and prior to the development of sophisticated internal models of space (such as those allowing scene rotation and manipulation). Studies of spatial and temporal lexeme acquisition among young children native in various European and Middle Eastera languages (Johnston and Slobin, 1979), (Weist, 1991),(Khanji and Weist, 1996) indicate that sub-ject groups of mean age as low as 30 months may correctly r associate pictures with spoken sentences such as &amp;quot;The parrot is in/on the cage &amp;quot;.</Paragraph>
      <Paragraph position="7"> SLanguage-specificity to this extent does not violate the requirements of the L0 task. While the Regier system was successfully applied to a number of natural languages, acquisition for a given language was performed independently of other training, utilising an output network encoding (and subtle adjustments to internal feature detector parameters) specific to that language. Syntactic variations are a similar limitation of system generality. rCorrectness is here a matter of statistically significant deviation from random performance, there being typically two alternative language fragments offered with a pair of images.</Paragraph>
      <Paragraph position="8"> Hogan, Diederich and Finn 237 Selective Attention and Acquisition of Spatial Semantics</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Relationship to Feature Binding
</SectionTitle>
      <Paragraph position="0"> Issues of visual search and object recognition must necessarily assume greater importance with increases in the complexity of the scene - with consequent difficulty in tagging of TR and LM - but some linkage must be provided between object identification and location if spatially based semantics are to be encoded and processed. (Treisman, 1996), notes that object instantiation requires construction from more elementary features (such as shape and colour) and maintenance of the resulting entity through displacement or continuous motion. While the exact neural mechanisms which mediate binding are unknown, the most likely candidates are thought to involve temporary cell assemblies selected by focussed attention - with activations corresponding to the attended object remaining undiminished and those away from this region being suppressed.</Paragraph>
      <Paragraph position="1"> Propagation of these activations through a global location map provides the common reference point needed to link disparate representations (Treisman and Gelade, 1980}.</Paragraph>
      <Paragraph position="2"> Significantly, the cognitive framework discussed above was introduced to explain performance degradation of visual search within cluttered domains with complex feature conjunctions, and is closely aligned with the neural mechanisms considered in the following sections. A model based upon selective attention thus has the advantage of a unified approach to the disparate processing requirements of the problem, while providing a sound base for extensions to more complicated input scenes and linguistic phenomena.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Selective Visual Attention
</SectionTitle>
    <Paragraph position="0"> It is well known that primate visual cortex receives information from the optic nerve at a rate well above the region's storage and processing capacities.</Paragraph>
    <Paragraph position="1"> Some mechanism of selective attention, whereby a small but important subset of the visual field may be given detailed processing is therefore necessary.</Paragraph>
    <Paragraph position="2"> Visual processing is typically decoupled into two regimes (Niebur and Koch, 1997): * A pre-attentive phase, during which parallel extraction of elementary features is performed * An attentive phase, during which the more salient or conspicuous stimuli within the field are processed in sequence, input from other stimuli being suppressed during this processing.</Paragraph>
    <Paragraph position="3"> Attentional processing thus requires some selection mechanism based upon the elementary features extracted during the pre-attentive phase - with possible external input from some other neural region or sensory'domain s. However, the selection mechanism need not be spatially sequential, and two types of covert 9 visual attention are commonly distinguished, governed largely by the nature of the (perhaps automatic) search task being undertaken by the visual system. Focal attention (Niebur and Koch, 1995),(Niebur and Koch, 1997) is a sequential search through a series of progressively less salient locations, selection being driven primarily from below - saliency being determined from the contributions of elementary features extracted during the pre-attentive phase. In contrast, dispersed or feature-based attention (Usher and Niebur, 1996), (Niebur and Koch, 1997) is spatially parallel, but regarded as sequential within some feature space - the selection relying upon some ~top-down&amp;quot; signal to highlight a particular conjunction of features.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Neural Gating - The Saliency Map
</SectionTitle>
      <Paragraph position="0"> Regardless of whether saliency is an emergent prop-erty of the input scene or imposed (perhaps consciously) from some other cortical region, each model requires that selection and suppression of stimuli be realisable in a neurally plausible structure. Most location-based attentional models are at present based upon the saliency map, introduced by (Koch and Ullman, 1985). While no localised neural implementation of this structure has been discovered, there is strong evidence for the existence of a mechanism based upon several elementary features extracted from the image (Niebur and Koch, 1997), and unit activations within the map are computed from a weighted sum of feature map outputs- giving a measure of &amp;quot;conspicuity&amp;quot; within each unit's receptive field 10.</Paragraph>
      <Paragraph position="1"> (Niebur and Koch, 1995) employ a total of eight input maps based upon orientation, intensity, chromatic components and temporal change - along with provision for &amp;quot;external&amp;quot; (i.e. top-down) inputs to s (Koch and Ullman, 1985) suggest that attentional control may be located as peripherally as the LGN, relying upon back-projections from cortical feature maps. 911igh resolution visual processing is dependent upon alignment of fovea and stimulus, normally achieved in primates through rapid eye and head movements in a process known as overt attention (Niebur and Koch, 1997). Neither mechanism is considered in this brief review, and our model assumes that covert attentional shifts are sufficient to capture phenomena of interest a simplification which must break down for wide field moving trajectors but is otherwise plausible.</Paragraph>
      <Paragraph position="2"> idegNote the similarity to the raasterfeature map of Feature Integration Theory (Treisman and Gelade, 1980).</Paragraph>
      <Paragraph position="3"> Hogan, Diederich and Finn 238 Selective Attention and Acquisition of SpatiaI Semantics</Paragraph>
      <Paragraph position="5"> account for cueing effects. The most salient feature in the input field is then computed by means of a winner-take-all network over the map, selecting the unit with the highest activation and suppressing output from the remaining units through recurrent connections. In addition, the winning unit is itself inhibited over time, allowing attention to shift to a salient (but previously unattended) stimulus even if the scene remains unchanged. This inhibition serves also to prevent the immediate return of attention to a previously attended site, in accordance with psychophysical evidence (Posner and Cohen, 1984),(Tipper et. al., 1991).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Neural Gating- Object Representation
</SectionTitle>
      <Paragraph position="0"> (Fujita et. ai., 1992) found through cell recordings that neurons within infero-temporal (IT) cortex are organised into columns with optimal selectivity toward abstractions of known objects (simple geometric shapes, differential shading etc.) with activation significantly greater when presented with the abstracted or minimalist image rather than a detailed photograph of a similar object. On anatomical (i.e. resource limitation) grounds, these findings suggest that objects may be represented through a combination of no more than 10O0 of these elemental pictures, with adaptation of representations occurring as necessary 11.</Paragraph>
      <Paragraph position="1"> Usher and Niebur's model of feature-based attention (Usher and Niebur, 1996) receives input from the entire visual field through such activated IT cortex cell assemblies, with the search task guided by weak &amp;quot;tol&gt;-down&amp;quot; activation of the favoured feature class from a similar representation in working memory (here taken to be pre-frontal cortex). While an explicit saliency map is not employed, the attended stimulus is again determined through competitive selection among the input representations (here object cell assemblies). In a cluttered field, top-down activation may provide a winning advantage to the favoured object.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Modulation at the Focus of Attention
</SectionTitle>
      <Paragraph position="0"> Once the most salient stimulus has been selected from among its competitors, some mechanism must be employed to facilitate passage of its associated input data to &amp;quot;higher&amp;quot; cortical centres while suppressing passage of competing input. In the Niebur and Koch model (Niebur and Koch, 1995), a modulating signal from the saliency map is propagated via 11 Note that these columns have low spatial selectivity - existing well along this visual processing hierarchy and are sensitive to such stimuli regardless of their position in the field.</Paragraph>
      <Paragraph position="1"> recurrent connections back to the region of primary visual cortex (V1) associated with the winning unit.</Paragraph>
      <Paragraph position="2"> Enhanced activation is thus re-propagated along the visual pathways, giving this input stream substantial advantages in any competitive selection processes subsequently encountered 12. Widespread propagation of an enhancement signal of this kind to features associated with an object at the most salient location in the visual field is thought to underpin feature binding (Treisman, 1996).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Model Architecture
</SectionTitle>
    <Paragraph position="0"> This section introduces a connectionist model for spatial lexeme acquisition based upon the attentional mechanisms discussed above. 0nly the model for static concepts is presented here, although few changes are necessary to the gross architecture to accommodate the dynamic case. As in the Regier model, an unstructured output or decision network encodes the lexeme representations, receiving input from neurally inspired processing modules - although here the object recognition pathway is explicitly considered. The following sections outline the gross architecture and functionality of the model, developing each substructure in turn before discussion of the output network. Implementation and representation issues are examined in section 5.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 A Conceptual Model for the Static
Case
</SectionTitle>
      <Paragraph position="0"> Each static scene may be chaxacterised as a movie consisting of repeated presentations of the same frame - attention initially focussed upon one object (for example the TR) and passing during movie presentation to the other (the LM). Network learning depends upon presentation of frames exemplifying each of these phases, and object tagging (identification of objects as respectively TR and LM) relies upon &amp;quot;visual search&amp;quot; initiated by parsing of the language fragment, and subsequent binding of object feature and location information. The approach is solidly grounded in the Feature Integration Theory of Treisman (Treisman and Gelade, 1980), with perceptual binding mediated through selective attention. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 The Recognition Pathway
</SectionTitle>
      <Paragraph position="0"> Processing corresponding to the early visual system is not explicitly modelled, and system input is provided by three unit banks, representing language input, object recognition and object location. Object 1~Enhancement of activation is accomplished via temporal tagging - modulation of the spike train through a time-varying Poisson process (Niebur and Koch, 1995).</Paragraph>
      <Paragraph position="1"> Hogan, Diederich and Finn 239 Selective Attention and Acquisition of Spatial Semantics representation is based upon the IT cortex assemblies discussed in section 3.2, with the simplifying assumption that input scenes contain only objects closely identifiable with a single iconic image - the system being restricted to a discrete set of object types whose presence is indicated by the activation of a single input unit I~.</Paragraph>
      <Paragraph position="2"> Language input is similarly reduced to a bank of object units, on the basis that apprehension of the object description (for example a simple noun such as circle) is sufficient to activate a representation of the object, already available in memory as a result of exposure to the image. In computational terms, the visual object has been tagged as a CIRCLE token, and the iconic CIRCLE representation activated, although the reality is far less neatly partitioned. This representation provides top-down activation in much the same manner as the working memory module of the Usher and Niebur model (Usher and Niebur, 1996), the mechanisms together reaiising object tagging through an abstraction of feature-based visual search.</Paragraph>
      <Paragraph position="3"> The relationship between language and object input is shown at the right of figure 2, tagging being represented by a conjunction between the language and object units within the binding network - the winning conjunction being selected through a Winner-Take-All (WTA) network (Feldman, 1982), and unwanted, weaker conjunctions being discarded.</Paragraph>
      <Paragraph position="4"> Such selection and suppression mechanisms readily allow generalisation of the tagging system to more cluttered scenes or sophisticated linguistic phenomena, particularly as tagging is performed over time - greatly reducing problems of cross-talk.</Paragraph>
      <Paragraph position="5"> The robustness of the cell assembly representation is here captured through multiple random projections from each unit to the binding network, ensuring with high probability that at least one connection with a particular binding unit is realised t4. The function of the binding subsystem is illustrated in the following table by the example of of figure 1 and the language fragment &amp;quot;circle above square&amp;quot;. Input from the object assemblies remains constant throughout the period, and for the sake of brevity is suppressed. For clarity, the number of scene frames is limited to four, with change in the language input after the second frame: ISExtensions to more complicated objects require representation in terms of a weighted combination of these iconic 'letters'. The binding mechanisms discussed here are in principle sufficient to handle such extensions, but pre-processing would necessarily be complex.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Integration with the Location Pathway
</SectionTitle>
      <Paragraph position="0"> For object tagging to be useful in the present context requires some integration of the feature and location * based models of selective attention considered in section 3. The mechanisms of the previous section are strongly reliant upon feature-based attention (Usher and Niebur, 1996), and do not require an explicit saliency map.</Paragraph>
      <Paragraph position="1"> Recall that location-based attentional models (Niebur and Koch, 1995), construct saliency as a weighted sum of several constituent feature maps which while representing anatomically distinct areas, provide inherent location binding. The model also provides for external input to this map to account for cueing - perhaps mediated through representations in working memory - but again the input is location bound.</Paragraph>
      <Paragraph position="2"> The current work preserves the global saliency map of (Niebur and Koch, 1995), but introduces feature based input to the map through the external channel of the previous paragraph, as though the primitive object cell assemblies of IT cortex were merely another feature map contributing to overall saliency. Both classes of model (colloquially 'where' and 'what'! rely on top-down modulation of activation in order to implement the selection of the attended region. In the former case, modulation takes place through recurrent connections to primary visual cortex, and 'where' to 'what' information transfer may take place through binding at the focus of attention - essentially through lock-step re-propagation of the modulation along both pathways - although this is not required for the present task.</Paragraph>
      <Paragraph position="3"> 'What'-to-'where' transfer in the current model is based upon an extension of the feature-based model of (Usher and Niebur, 1996), with propagation of top-down modulation from the IT assemblies to striate cortex, and re-propagation as for the 'where''what' linkages Is.</Paragraph>
      <Paragraph position="4"> lSOnly limited success has been achieved to date in elucidating mechanisms of communication between the two pathways, although binding of representations necessarily demands it. The model proposed here is attractive and plausible, but remains to be established experimentally (Niebur, 1997).</Paragraph>
      <Paragraph position="5">  This mechanism is abstracted in the current model so that object-based input is effectively represented in another feature map, although some delay to account for the traversal of the pathway may be desirable in more sophisticated extensions. However, the approach effectively eliminates the need for direct input from the object assemblies to the decision network, as binding has been extended to the saliency map. As before, we may characterise this interaction by examining the bindings realised. The input sequence is as before, but suppressed for clarity, and location input is restricted to representative vectors xl (the square) and x2 (the circle).</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Lexeme Binding
</SectionTitle>
      <Paragraph position="0"> As in the Regier model, lexeme acquisition is ultimately accomplished through a sparse-coded representation at an unstructured (here randomly connected) output network. In its purest form, the model exerts very tight control over the information which is passed to this decision network - object and location information being effectively gated by the saliency map. This decoupling of the problem both simplifies and complicates the issue: binding at the output network requires a lower degree conjunction, but the lexeme is now in principle a temporal rather than spatial conjunction - necessitating a recurrent output network.</Paragraph>
      <Paragraph position="1"> Bottom up saliency is of relatively little consequence in the static case, as the conscious selection implied by the object tagging mechanism controls the focus of attention, and these considerations cannot be over-ridden by an unchanging input scene although decay of the most salient location helps facilitate the attention shift.</Paragraph>
      <Paragraph position="2"> Figure 2 shows the gross architecture in its entirety. Lexemes are represented by individual output units of the decision network, gated input being provided to this network from the saliency map, and language input (i.e. encoding of the lexeme itself) implicit in the learning mechanism. At this point, the network must represent a binding of the form: above &lt; TR(zl), LM(x2) &gt;.</Paragraph>
      <Paragraph position="3"> Successful acquisition of such bindings is dependent upon the structure of the saliency map and its relationship to the output network, and these issues are considered in detail in the following sections.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML