File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1412_metho.xml
Size: 28,862 bytes
Last Modified: 2025-10-06 14:14:48
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1412"> <Title>A Syndetic Approach to Referring Phenomena in Multimodal Interaction</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Syndetic Modelling </SectionTitle> <Paragraph position="0"> The word syndesis comes from the ancient greek (aw = together and 5~o~ = to tie), meaning to bring, to connect, to compose together. It conveys the idea of being able to reason about complex systems as a whole while keeping the capability of isolating and reasoning about their basic components at the same time.</Paragraph> <Paragraph position="1"> In our case, the syndetic model of an interactive system extends the formal model of its interface with the model of the cognitive resources needed to interact with the devices. Earlier work in this direction has been using state based notations and was aiming at the exploration of this field at a high level of abstraction (Barker and Buxton, 1987; Chan et al., 1984). In other approaches theoretical models originating from psychology have been used in an indirect way, see for example (Card et al., 1990; Fitts, 1954).</Paragraph> <Paragraph position="2"> We deviates from those early approaches by using cognitive models in a direct way within the design and specification process and find our justification for such an approach in that the factors that affect 8~ G.P. Faconti and M. Massink usability depend on psychological and social properties of cognition and work, rather than on abstract mathematical models of programming semantics.</Paragraph> <Paragraph position="3"> Although in principle any cognitive theory might be adopted, we address one particular cognitive model, Phil Barnard's Interacting Cognitive Sub-systems or shortly ICS (Barnard and May, 1993; Barnard and May, 1994). We formally model aspects of this theory in such a way that it can be combined with a traditional system specification. The formal model of the system provides few insights into the usability of its interface as well as the formal model of the user derived from some psychological theory supports general claims about the user's cognitive processes but not about the effective use of cognitive resources in a given context. By combining both of them in a syndetic model we can reason about how cognitive resources are mapped onto the functionality of the system.</Paragraph> <Paragraph position="4"> Within this approach, we consider an abstract view of the flow of information between devices, users and system. To facilitate precise description and modelling at this level, we make use of a specification notation in which the various components (device, system and user) are modelled as interactors. The concept of an interactor has been de-scribed in detail elsewhere, for example (Duke and Harrison, 1993; Faconti and Paterno, 1990). Briefly, an interactor is an object-like entity with an internal state, a presentation through which parts of the state (called percepts) can be perceived by a user, and actions - either user or system initiated - that bring about changes to the state. Interactors have been described using a number of formal notations including Z, LOTOS and MAL (Modal Action Logic), and it is the last of these that is used here. Briefly, MAL (Ryan et al., 1991) is a typed first-order logic that extends the predicate logic with an additional operator. For any action 'A' and predicate 'P', the predicate '\[A\] P' means that after the action A is performed, P must hold.</Paragraph> <Paragraph position="5"> Interactors can describe the logical and physical components of an interactive system, but by themselves give little direct insight into how a user might or might not be able to use the system. This is a problem, as many of the developments in interactive systems that can benefit from use of abstract models also depend critically on human abilities to process information. Syndetic models (Duke, 1995; Duke et al., 1995; Faconti and Duke, 1996) address this problem by expressing the behaviour of computing and cognitive systems within a common framework that supports reasoning about the conjoint system. Clearly, the 'computer' component of a syndetic model is determined by the system being represented, but for the cognitive side there is a range of models to choose from, each emphasising different aspects of human information processing. The approach that we have adopted for syndetic modelling is called Interacting Cognitive Subsystems, or ICS, and is summarised in Section 3. Importantly, ICS operates in terms of resources and information flow at a level of abstraction that is commensurate with that used to describe interactors.</Paragraph> </Section> <Section position="3" start_page="0" end_page="7" type="metho"> <SectionTitle> 3 Interactive Cognitive Subsystems (ICS) </SectionTitle> <Paragraph position="0"> ICS is a comprehensive model of human information processing that describes cognition in terms a collection of sub-systems that operate on specific mental codes or representations. Barnard and May identify two major aspects of ICS: a theory of representation and a theory of information flow. Interestingly, the two kind of theories can be related respectively to abstract data types and state based specifications, and to process algebraic and data flow approaches in computer science. The work on syndetic modelling has concentrated only on the capturing the theory of information flow and on exploring problems by reasoning about information flow. The area of the representation of the mental codes has not yet been explored from a formal perspective. Recently, the ICS project at MRC Applied Psychology Unit in Cambridge and at the Departments of Psychology of University of Sheffield and Copenhagen has developed a systematic treatment of visual structures (May et al., 1995; May et al., 1997) that will be part of our future research.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Information flow in ICS </SectionTitle> <Paragraph position="0"> ICS represents the human information processing as a highly parallel organization with a modular structure composed of nine sub-systems. Although specialised to deal with specific codes, all subsystems have a common architecture, shown in Figure 1.</Paragraph> <Paragraph position="1"> from store~l~ll~l~l~ll~l ~ store I _ i.p= of ~ |_1 ~11 trensform C to X .o,,o 'c,oY input an'ay ~ J These subsystems can perform two kinds of operation upon the representations that they receive at</Paragraph> <Paragraph position="3"> A Syndetic Approach to Referring Phenomena in Multimodal Interaction 85 an input array. They can copy the representation directly into the image record, which acts as a memory local to each subsystem, and they can transform the information into another mental representation and pass it through a data network to other subsystems. The transformation processes within each subsystem are independent and can work in parallel.</Paragraph> <Paragraph position="4"> The representations that can be output by a sub-system are limited by the informational content of the representations that it operates upon; that is, a subsystem cannot produce output in every representation. Moreover, any one transformation process can only operate upon a single coherent data stream at one time. That is, it can only operate upon one representation, and can only produce one output representation.</Paragraph> <Paragraph position="5"> If the incoming data is incomplete, a subsystem can augment it by accessing the image record. Coherent data streams may be blended at the input array of a subsystem, with the result that a process can transform data derived from multiple input sources in one step. This balances the output limitation. null The nine subsystems are further distinguished depending on their functionality as: Sensory subsystems VIS visual: hue, contour etc.</Paragraph> <Paragraph position="6"> AC acoustic: pitch, rhythm etc.</Paragraph> </Section> <Section position="2" start_page="0" end_page="7" type="sub_section"> <SectionTitle> Structural subsystems </SectionTitle> <Paragraph position="0"> OBJ object: mental imagery, etc.</Paragraph> <Paragraph position="1"> MPLmorphonolexical: lexical forms, etc. Effector subsystems ART articulatory: subvocal rehearsal, etc. LIM limb: motion of limbs, eyes, etc The nine subsystems acts effectively as communicating processes running in parallel as shown in Figure 2.</Paragraph> <Paragraph position="2"> The overall behaviour of the cognitive system is governed by a number of principles, most of which are out of the scope of this paper. Here, we will address only those configurations that are relevant to interact with the system described in the previous section. Configurations are the way in which ICS resources are deployed at a point in time to perform a cognitive task. Complex configurations can be constructed from elementary, partial ones, and if an information flow can be constructed, then it is a legal configuration, subject to three constraints. The first one is that no process can appear more than once in a configuration. The second constraint is that the order of cyclical flows within the configuration is not important. Finally, although any one of the sensors or effectors may be missing, if all sensors or effectors or both are missing in a configuration there must be a central flow. In other terms, input alone is meaningless and no output can be generated without either input or central activity.</Paragraph> <Paragraph position="3"> The key observation underlying syndetic modelling is that the structures and principles embodied within ICS can be formulated as an axiomatic model in the same way as any other information processing system. This means that the cognitive resources of a user can be expressed in the same framework as the behaviour of computer-based interface, allowing the models to be integrated directly. To begin this process, we define some sets to represent those concepts of ICS that will be used here. Here and elsewhere in this document we will make use of the Z notation (Spivey, 1982) to define data types; much of this is based on common mathematical conventions for sets and relations, for example 'x' for cartesian product and 'IP' for power set.</Paragraph> <Paragraph position="5"> Representations consist of basic units of information organised into superordinate structures. Coherence of units depends on several issues, including the timing of data streams, that will not be addressed here. Instead, coherence is captured abstractly in the form of an equivalence relation over representations: null _ ~ _ : repr ~ repr In describing ICS it is also useful to discuss the representations that are being delivered as part of a particular data stream. We therefore introduce a 86 G.P. Faconti and M. Massink further set, code, whose elements are representations that have been labelled by the subsystem in which they were generated. Representations from or to the outside world are tagged with '*'&quot; code == reprxsys In general we will write R, ys for the code (R, sys), and ':src-dst:' for the transformation (arc, dst). The state of the ICS interactor captures the data streams involved in processing activities and the properties of the streams such as stability and coherence which define the quality of processing, or in other words, user competence at particular tasks. The sources of data for each transformation is represented by a function 'sources' that takes each transformation 't' to the set of transformations from which 't' is taking input. In general only a sub-set of transformations are producing stable output, and this set is defined by the attribute 'stable'. The codes that are available for processing at a subsystem are identified by a relation _~_, where 'c@s' means that code 'c' is available at subsystem 's'.</Paragraph> <Paragraph position="7"> As not all representations are coherent, only certain subsets of the data streams arriving at a system can be employed by a process to generate stable output. The set 'coherent' contains those groups of transformations whose output in the current state can be blended. If the inputs to a process are coherent but unstable, the process can still generate a stable output by buffering the input flow via the image record and thereby operating on an extended representation. However, only one process in the configuration can be buffered at any time 1, and this process is identified by the attribute 'buffered'. The configuration itself is defined to be those processes whose output is stable and which are contributing to the current processing activity.</Paragraph> <Paragraph position="9"> Four actions are addressed in this model. The first two, 'engage' and 'disengage', allow a process to modify the set of streams from which they are taking information, by adding or removing a stream. A process can enter buffered mode via the 'buffer' action. Lastly, the actual processing of information is tThis is actually a simplification for the purposes of the paper.</Paragraph> <Paragraph position="10"> represented by 'trans', which allows representations at one subsystem to be transferred by processing activity to another subsystem.</Paragraph> <Paragraph position="12"> The principles of information processing embodied by ICS are expressed as axioms over the model defined above. Axiom i concerns coherence, and states that a group of processes are coherent if and only if they have the same kind of output (in the code of the system 'dest') and that the representations produced by the processes and therefore available at 'dest' are themselves coherent.</Paragraph> <Paragraph position="13"> :t-dest: E trs A qtOdest / The second axiom is that a transformation is stable if and only if its sources are coherent, and either it is buffered or the sources are themselves stable. A configuration then consists of those processes that are generating stable output that is used elsewhere in the overall processing cycle.</Paragraph> <Paragraph position="15"> A process will not engage an unstable stream (axiom 4). If its own output is unstable, it will either engage a stable stream, disengage an unstable stream, or try to enter buffered mode (axiom 5). The remaining axioms (5-7) define the effects of the three actions.</Paragraph> <Paragraph position="17"> \[disengage(t, s)\] sources(t) : S - {s} The remaining two axioms define the effect of information transfer. Axiom 9 is the 'forward' rule: if a representation is available at a subsystem, then after trans a suitable representation will be available at any other subsystem for which the corresponding process is stable. Conversely, if after trans some information were to become available at a subsystem (dest), then there must exist some source system such that the information is available at the source, and the corresponding transformation is stable.</Paragraph> </Section> <Section position="3" start_page="7" end_page="7" type="sub_section"> <SectionTitle> 3.2 The structure of mental representations </SectionTitle> <Paragraph position="0"> Most of the formal account of ICS given in the previous section relies on an understanding of representations and of their structure.</Paragraph> <Paragraph position="1"> In (May et al., 1997) the process of perception is described as one of structuring the sensory information that we receive from objects in the environment so that we can interact with them. The details about the structure of objects and their inter-relations are not explicitly contained in the sensory information.</Paragraph> <Paragraph position="2"> It must be interpreted by combining this information with knowledge about the world, which we have learnt through our experience of interacting with it.</Paragraph> <Paragraph position="3"> Computer displays are like the rest of the world in this respect. Consequently designing a computer display is all about choosing the form of objects and arranging them so that they are perceived and dealt with by the user of the computer. Different arrangements of the same set of forms may lead to different structuring of objects' representations. This may result in different performances of a particular task by the user.</Paragraph> <Paragraph position="4"> When we look at a visual scene, the features, colors and textures in the sensory information group together to form objects. If we look closely at an object, we can see that it has also a structure and may be composed by other objects. We can see the world at different scales, from a global level, down to many levels of details. For example, figure 3 can be seen as a computer display with objects in it. Focusing the attention toward a particular object we may see either a window or a cursor and so on. This hierarchy can be represented as a structure diagram, as in figure 4, where the horizontal groupings are sets of objects at different levels of the visual structure.</Paragraph> <Paragraph position="5"> What we perceive at a given moment is limited by the level at which we analyse the scene. For example, a test made with figure 3 on a number of our colleagues revealed that the totality of them sees a 'list of cities that can be scrolled'. Clearly, this information is the result of an interpretation of the raw sensory data obtained from the eyes and enriched by a set of mental processes that convert the visual representation into an object one to which a semantic information is further added.</Paragraph> <Paragraph position="6"> What it is important to notice is that the attention has been focused on the 'list' node in the structure, that to reach that node one might have searched through it, and that 'list' is related to 'scrollbar'.</Paragraph> <Paragraph position="7"> According to (May et al., 1997) we say that 'list' is the psychological subject being attended, 'scrollbar' (i.e. objects in the same group of the psychological subject) forms its predicate, and 'cities' (i.e. the sub-structure rooted at the psychological subject) form its attribute. The attention can be easily 88 G.P. Faconti and M. Massink moved towards one of the predicates of the subject by swapping the subject-predicate relation. Diverting the attention to a far object in the structure requires much more cognitive load since it implies the traversal of larger parts of the structure.</Paragraph> <Paragraph position="8"> Clearly we are describing a 'static' situation where the persons were explicitly asked to perform only a recognition task. In dynamic (real case) situations the same sensory information is interpreted to perform different tasks either in a sequence or in parallel. For example, to move the cursor over an item (i.e. a city name) one must establish a relation between the cursor and the item that requires a reworking of the structure. This can be described as defining a ghost object to which both the cursor and the item are rooted. The ghost is maintained until the cursor-item relation is needed to perform the required task and hides the previous structure for that period of time as shown in figure 5. During this period the objects in hiding cannot part of the psychological subject. Designing presentations leading to stable structures over tasks greatly increases the ease of the interaction by reducing the cognitive load necessary for the restructuring of structures.</Paragraph> <Paragraph position="10"> This reasoning leads to add a further axiom to the ICS theory. Two transformation processes within the same subsystem can act in parallel over the same representation or over two representations such that one is not a sub-structure of the other (they are d/sjoint). Disjonction is captured abstractly in the form of a relation over code:</Paragraph> <Paragraph position="12"/> </Section> <Section position="4" start_page="7" end_page="7" type="sub_section"> <SectionTitle> 3.2.1 Levels of mental representations </SectionTitle> <Paragraph position="0"> In the previous section we have seen that sensory information is interpreted in order to build structured mental representations. The interpretation requires the participation of several subsystems that are deployed in a configuration. The understanding of the structure in figure 4 is given by that the sensory information from eyes forms a visual representation made of colours and the likes that gives rise to the configuration represented in figure 6.</Paragraph> <Paragraph position="1"> A mental process (VIS) transforms (:vis-obj:) it into an object representation that involves the structuring of sensory data into objects, and the grouping together of those objects. This new representation can be interpreted by another mental process (OBJ) and transformed (:obj-prop:) to produce a more abstract representation at propositional level in which objects are identified and related. At this point a third transformation (:prop-obj:) takes place at the propositional subsystem (PROP) that feeds back information about object structure. After this tranformation the object structure that is perceived is a blend of information from propositional and visual sources. For this to take place, a number of conditions must be met according to the formal ICS theory, such as:</Paragraph> <Paragraph position="3"> The configuration deployed so far doesn't justify that the items in the list are recognized as cities. In order to do this the objects' structure must be made available to the morphonolexical system (MPL) as a structured representation of sound. Consequently, A Syndetic Approach to Referring Phenomena in Multimodal Interaction 89 the :obj-mph transformation operates in parallel with the :obj-prop: one on the same code and produces a morphonolexical representation that is equivalent to the one sent to the propositional susbsystem. The morphonolexicai susbsystems transforms (:mpl-prop:) the representation into propositionai code that is blended to the one produced directly by the object susbsystem. At the propositional system the :prop-mph transformation is activated in parallel with the :prop-obj: that feeds back semantic information to the morphonolexical system and enrich the object structure by blending with the object source. Again this requires that some additional properties are satisfied in the ICS theory, such as: {{:obj-prop:, :mpl-prop:}, {: prop - mpl :, : obj - mpl :}} E coherent The configuration described in the previous section can be defined as the reading one. In fact, it might be noticed that once the object representation is transformed by :obj-mph and made available to the morphonolexical subsystem it is also ready to be spoken by the articulatory system after an :mpl-art: transformation. This read aloud configuration is obtained by adding the :mpl-art: and the :art-speech: transformations to the reading configuration so that {:mpl-art:, :art-speech:} C_ stable A similar reasoning can be applied to the object subsystem in the sense that once the object structure is formed, the :obj-lim: tranformation can generate the limb code equivalent to the object representation so that (for example) the hand operates the currently selected psychological subject. The new configuration is obtained by imposing that * ~:obj-lim:, :lim-hand:} C_ stable Together with the described configuration two feedback loops exist involving the body-state subsystem which is a source of sensory information. This information represents sensations that our body detects from tasting, touching and smelling as well as information from internal sensations such as the position of our arms and legs and the state of our muscles. null In our case the body state transforms two disjoint representations from an interpretation of the hand position and muscle state, and of the state of the vocal muscles. The information at this level of representation is important to co-ordinate our physical actions because they enrich the limb and articulatory representations by blending with those produced by the object and the morphonolexical subsystems. Clearly, the followings must hold: The final configuration describing the cognitive view of performing a deitic reference by speech and gestures is shown in figure 7. In the following we will refer to the configuration as deixis - Conf.</Paragraph> <Paragraph position="4"> 5 Description of the system interface From the system perspective, the problem can now be formulated as the speficatlon of a presentation that allows the speech and gesture configuration of ICS to be naturally deployed when making use of deixis.</Paragraph> <Paragraph position="5"> In principles, the devices we could use to implement an interface supporting deixis range from traditional tablets to data gloves, from cameras to video recorders and players, from speakers to microphones, from fiat to head mounted displays with stereoscopic views, and many others. Here we will compare two systems respectively built from a display and a mouse, and a display equipped with a touch screen.</Paragraph> <Paragraph position="6"> The comparison can be easily extended to the case of devices with similar caracteristics with respect to the addressed task such as a tablet instead of the mouse, and a data-glove for the touch screen.</Paragraph> </Section> <Section position="5" start_page="7" end_page="7" type="sub_section"> <SectionTitle> 5.1 Display and mouse based interface </SectionTitle> <Paragraph position="0"> The most common and widespread graphical device is the 2D mouse, a physical device equipped with two transducers able to measure the distance between a current position and a next point along two axes and with a number of buttons (usuaily from one to 90 G.P. Faconti and M. Massink three). The buttons have little value for the purposes of this paper, and are disregarded. The mouse can be described by a very simple interactor, where the type 'RelPos' represents relative positions, i.e.</Paragraph> <Paragraph position="1"> offsets.</Paragraph> <Paragraph position="2"> The Mouse interactor describes the state space of the device as a coordinate defining the distance of the current position from the previous one along two coordinate axis (RelPos == delta-xMouse x deltayMouse). The ~'\] decoration of the 'operate' action means that the device is sensed by the body-state subsystem when it is used, and the notation \[...\] is used to refer to the perceivable aspect of an attribute, interactor or action.</Paragraph> <Paragraph position="3"> While the mouse can be used as a pure input device, it is usually coupled with a cursor that provides the feedback of the current position in the display space (DispPos). The cursor is an object amongst the others of type Obj in a display whose position is related to the mouse by a coordinate transformation. Consequently, we explicitly distinguish the cursor in specifying a display interactor.</Paragraph> <Paragraph position="5"> Objects are located in the display. The cursor location is computed by transforming the current mouse movement at the next refresh of the screen (action 'render'). An object in the display is related to the cursor when it has the same position. The decoration indicates that the objects in the display are visually perceivable.</Paragraph> </Section> <Section position="6" start_page="7" end_page="7" type="sub_section"> <SectionTitle> 5.2 Touch-screen based interface </SectionTitle> <Paragraph position="0"> If we plan to use a touch-screen display to build our interface, there exist only one device, namely the display. In contrast with the mouse-display pair, the</Paragraph> </Section> </Section> <Section position="4" start_page="7" end_page="7" type="metho"> <SectionTitle> 6 Building the Syndetic Model </SectionTitle> <Paragraph position="0"> The syndetic model of device interaction is created by introducing both the user and system models into a new interactor and then defining the axioms that govern the conjoint behaviour of the two agents. A new attribute (goals) is used to 'contextualise' the generic ICS model to the task of making a deltic reference as set of pairs Obj x Operation. Here, a more realistic approach might be to describe a class of desired or acceptable displays. However, it would add little to the analysis.</Paragraph> <Paragraph position="1"> The configuration must be set to deixis-Conf and the (goa/s) attribute is initialized. For the goal to be achieved we locate the buffer to the propositional subsystem to revive the :prop-obj: transformation. ((item, speak) I I (item, locate)) In initializing the goals we use the action prefix ';' notation to indicate sequentiality and 'll' to indicate parallel composition.</Paragraph> </Section> class="xml-element"></Paper>