XML Viewer - c00-1053

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1053_metho.xml
Size: 23,714 bytes
Last Modified: 2025-10-06 14:07:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1053">
  <Title>Deixis and Conjunction in Multimodal Systems</Title>
  <Section position="2" start_page="362" end_page="363" type="metho">
    <SectionTitle>
~ Commands
</SectionTitle>
    <Paragraph position="0"> Figure 1 Modular architecture (Johnston 1998a) As an example of a multimodal command, in order to reposition an object, a user might say 'move this here' and make two gestures on the display. The spoken command 'move this here' needs to combine with the two gestures, the first indicating the entity to be moved and the second indicating where it should be moved to. In cases where the spoken string needs to combine with more than one gesture, it is assigned a multimodal subcategorization list indicating the gestures it needs, how they contribute to the meaning, and the constraints on their combination. For e.xample, SLI' assigns 'move this here' the feature structure in Figure 2.</Paragraph>
    <Paragraph position="1"> The mmsubcat: list indicates that this input needs to combine with two gestures. The spoken command is constrained to overlap with or follow within five seconds of the first gesture.</Paragraph>
    <Paragraph position="2"> The second gesture must follow within five seconds of the first. The first provides the entity to move and second the new location. GP assigns incolning gestures feature structure representations specifying their semantic type and any object they select and passes these on to MP. MP uses general combinatory schelnata for nmltimodal subcategorization (Jolmston 1998a, p. 628) to combine the gestures with the speech, saturate the nmltilnodal subcategorization list, and yield an executable command.</Paragraph>
    <Paragraph position="3"> cal.:s tlbcat_COlllnlarKl limc:\[l\]</Paragraph>
    <Paragraph position="5"> Figure 2 Feature structure for 'move this here' Tiffs approach has many advantages. It allows for a great degree of expressivity. Combinations of speech with multiple gestures can be described as can visual parsing of collections of gestures.</Paragraph>
    <Paragraph position="6"> Unlike many previous multilaodal systems, the approach is not speech-driven, any piece of content can come fiom any mode. Another significant advantage is the lnoclularity of spoken hmguage parsing (SLP) and multimodal parsing (MP). More general rules regal'ding multimodal integration are in MP while the specific speech graMlllar used for an application is in SLP, enabling reuse of the multimodal parsing module for different applications. This modularity also enables plug-and-play of different kinds of spoken language parsers with the same multimodal parsing component. SLP can be a traditional chart parser, a robust parser, or a stochastic parser (Gorin et al 1997). The modularity of SLP and MP also facilitates the adoption of a different strategy for string parsing tY=om that used for multimodal parsing. Traditional approaches to string parsing, such as chart parsing (Kay 1980) assume the combining constituents to be discrete and in linear order. This imposes significant constraints on the combination of elelnents, greatly reduces the number of Colabinations that need to be considered, and facilitates prediction in parsing. In contrast, multimodal input is distributed over two or three spatial dimensions, speech, and time.</Paragraph>
    <Paragraph position="7"> Unlike words in a string, speech and gesture may overlap temporally and there is no singular dimension on which tim input is linear and discrete. The constraints that drive parsing are  specific to the combining elements and there is not the same general means for predictive parsing (Johnston 1998a).</Paragraph>
    <Paragraph position="8"> While the modularity of spoken language processing and multimodal parsing in Johnston 1998a has many advantages, the assumption that all processing of the spoken string takes place before multimodal integration leads to significant difficulties as the spoken language processing component is expanded to handle more complex language and to provide a compositional analysis of spoken language containing deictics.</Paragraph>
  </Section>
  <Section position="3" start_page="363" end_page="364" type="metho">
    <SectionTitle>
2 Compositional analysis of deictics
</SectionTitle>
    <Paragraph position="0"> The basic problem the approach faces is to provide an analysis of spoken language in multimodal systems which enables the appropriate multilnodal subcategorization frame and associated constraints to be built compositionally in the course of parsing the spoken string. Whatever the syntactic structure of the spoken utterance, the essential constraint on the multimodal subcategorization is that the list of subcategorized gestures match the linear order of the deictic expressions in the utterance, and that the temporal constraints also reflect that order. This can be thought of in terms of lambda abstraction. What we need to do is abstract over all of the unbound variables in the predicate that will be instantiated by gesture. For an expression like 'move tiffs here' we generate the abstraction. 2ge,,tio.2gh,catio,,.nlove(ge,,tio.,glocatio,,). In terms of the analysis above, this amounts to deriving the feature structure in Figure 2 compositionally fi'om feature structures assigned to 'move', 'this', and 'here'.</Paragraph>
    <Paragraph position="1"> One way to accomplish this within the modular approach is to set up the spoken language processing component so that it manipulates two subcat lists: a regular spoken language subcat: list and a multimodal mmsubcat: list. Information about needed gestures percolates through the syntactic parse. The verb 'move' is assigned tim feature structure in Figure 3. It subcategorizes (in the string) for an entity and for a location. If the arguments are not deictic, for example 'move the supplies to the island' the verb simply combines with its arguments to yield a complete command.</Paragraph>
    <Paragraph position="2"> Deictic expressions are assigned structures which subcategorize for phrases which subcategorize for NPs (the deictic expression is essentially type raised). The structure for 'this' is given in Figure 4. Tim structure for 'here' is like that for 'this', except that it selects for a verb subcategorizing for a location rather than an entity (subeat:first:  subeat:first:eontent:type is location).</Paragraph>
    <Paragraph position="3"> -cat : v deictic : no time :114\] I-t: 'pc : move content : I o 9ject : \[1 \]\[tzpe : entity\] Llocation : \[2\]\[type : location c.t,p 1\] \] / :\[ oi.e.t :lU\] . uUca': / \[first : \[cat : np /,'est : / keontent : \[21 L L rest : end \[list :\[31 \] mmsubcat :/end : \[3\] Llasttime : \[4\] Figure 3 Feature structure for 'move' -cat:v dcictic:yes content:\[ 1 \] lime:\[9\] subcat first:  In 'move this here', 'this' combines with the verb to its left, removing the first specification on the subcat: list of 'move' and adding a gesture specification to the resulting mmsubcat:. Then 'here' composes to the left with 'move this' relnoving the next specification on the subeat: and adding another gesture specification to the mmsubcat: I. The constraint on the first gesture i Directionality features in subeat: used to control the relative positions of combining phrases are omitted here to simplify tile exposition.</Paragraph>
    <Paragraph position="4">  differs from that on the others. The t'irst must overlap o1 precede the speech, while tile others lnust follow the preceding gesture. This is achieved with the feature deictie: which is set to yes when composition with the first deictic takes place. The setting of this t'eature determines which of the temporal constraints applies (using conditional constraints). The lasttime: feature always provides the time of the last entity in the sequence o1' inputs. The mmsubcat:end: feature provides access to the end of the current mmsubcat: list. Once the subcat: feature has value end the mmsubcat:end: needs to be set to end and then the value of nunsubcat:list: is the same as lhe msubcat: in Figure 2 and can be passed on to the multimodal parser.</Paragraph>
    <Paragraph position="5"> So then, it is possible to set up tile speech parsing granlular so that it will build tile needed subcategorization for gestures and modularity between specch parsing and multimodal parsing can be maintained. However, as more complex phenomena are considered tile resulting gramlnar becomes more and more complex. In tile example above, the deictic NPs are pronouns ('lhis', 'here'). The grammar of noun phrases needs to be set up so that tile presence of a deictic determiner makes the whole phrase subcategorize for a verb as in 'move this large green one here'. Matters becolne lnore complex as tile grammar is expanded to handle conjunction, for example 'move this and this he,w'. An analysis of nolninal col\junction can be set up in which the multimodal subcategorization lists of conjuncts are combined and assigned constraints such that gestures are required in the order in which the dcictic words (or other phrases requiring gestures) appear. If a deictic appears within a conjoined phrase, that phrase is assigned a representation which subcategorizes for a verb (just as 'this' does above). In 'move this and this there', 'this and this' combines with 'move' then 'there' combines with the result, yielding an expression which subcategorizes for three gestures. The treatment of possessives also needs to be expanded to handle deictics. For example, in 'call this pelwon's mmtber', 'this l)etwon 's number' needs to subcategorize for a verb which subcategorizes fox a nmnber while the multimodal subcategorization is for a gesture on a person. The possibility of larger phrases mapping onto single gestures further complicates matters. For example, to allow lk~r 'move.fi'om here to there' with a line gesture which connects tile start and elld points, SLP will need to assign multimodal subcategorization list with a single line element to the whole phrase 'from here to there', in addition to the other analysis in which this expression multimodally subcategorizes for two gestures. An alternative is to have a rule that breaks down any line into its start and end points. The problem then is that you introduce subpart points into the muitimodal chart that could combine with other speech recognition results and lead to selection of the wrong parse of the multimodal input. Keeping the points together as a line avoids this difficulty but complicates tile SLP grammar. I return to these cases of larger phrases subcategorizing for single gestures in Section 5 below.</Paragraph>
    <Paragraph position="6"> If tile separation of natural language parsing and multimodal integration is to be maintained, the analysis of deictics 1 have shown, or one like it, has to permeate the whole of the natural language grammar so that appropriate nmltimodal subcategorization frames can be built in a general way. This can be done, but as the coverage of the natural language grammar grows, the analysis becomes increasingly baroque and hard to maintain. To overcome these difficulties, I propose here a new architecture in which spoken language parsing and multimodal parsing are interleaved and multilnodal integration takes place at the constituent structure level of simple deictic NPs.</Paragraph>
  </Section>
  <Section position="4" start_page="364" end_page="365" type="metho">
    <SectionTitle>
3 Interleaviug spoken language parsing
</SectionTitle>
    <Paragraph position="0"> and multimodal parsing There are a nmnber of different ways in which spoken language parsing (SLP) and multimodal parsing (MP) can be imerleaved: (1) SLP populates a chart with fragments, these are passed to MP which determines possible combinations with gesture, the resulting combinations are passed back to SLP which continues until a parse of the string is found, (2) SLP parses the incoming string into a series of fragments, these become edges in MP and are combined with gestures, MP is augmented with rules from SLP which operate in  MP in order to complete the analysis of the phrase, (3) SLP and MP are merged and there is one single  gralnmar covering both spoken language and multimodal parsing (cf. Johnston and Bangalore 2000). 1 adopt here strategy (1) represented in</Paragraph>
    <Section position="1" start_page="365" end_page="365" type="sub_section">
      <SectionTitle>
Commands
</SectionTitle>
      <Paragraph position="0"> Figure 5 Interleaved architecture A significant advantage of (1) is that it limits the number of elements and combinations that need to considered by the nmltimodal parser. The complexity of the inultidilnensional parsing algorithm is exponential in the worst case (Johnston 1998a) and so it is important to limit the number of elements that need to be considered.</Paragraph>
      <Paragraph position="1"> Another advantage of (1) over (2) and (3) is that as in the modular approach, the grammars are separated, facilitating reuse of the multimodal component for applications with different spoken COlnmands. Also, (2) has the problem that there is redundancy among the SLP and MP grammars, both need to have the grammars of verb subcategorization, conjunction etc.</Paragraph>
      <Paragraph position="2"> Returning now to the example above, 'move this here'. The representation of 'move' is as before in Figure 3, except there is no mmsubcat: feature. The difference lies in the representation of the deictic expressions. In the first pass of SLP, the deictic NP 'this' is assigned the representation in Figure 6 (a). I have used &lt; &gt; to represent the list-wdued mmsubcat: feature and the constraints: feature is given in { }. The location deictic 'here' is assigned a similar representation except that its content:type: feature has value location. All deictic expressions (those with deictic: yes) are passed to MP. MP uses a general subcategorization schema to combine 'this' with an appropriate gesture, yielding the representation in Figure 6 (b). The multimodal subcategorization schema changes the eat: featum from deictic_np to np when the mmsubcat: is saturated. Much the same happens for 'here' and both edges are passed back to SLP and added into the chart (the chart: feature keeps track of their location in the chart). Now that the deictic NPs have been combined with gestures and converted to NPs, spoken language parsing can proceed and 'move' combines with 'this' and 'here' to yield an executable command which is then passed on to MP, which selects the optimal multimodal command and passes it on for execution. In examples with conjunction such as 'move this and this here', the deictic NPs am combined with gestures by MP belbre conjunction takes place in SLP, and so there is no need to complicate the analysis of conjunction.</Paragraph>
      <Paragraph position="3">  cat : dcictic_np deictic : yes time: \[1\[ \[type: entity \] cdegntent : \[selection :\[21J /\[cat: spatial_gesture \]\ /I. * . \[type:area 3/\ ,:a&gt; kso,o ,io,,.</Paragraph>
      <Paragraph position="4"> mmsubcat : tLtime : \[31 J \ / \{overlap(\[l\],\[3\]) v / \follow(\[1 \],\[3\]..5)} / chart : \[1,2\] \[cat : hi' \] \]deictic : no \] /L~/. . \[type : entity \]/ &amp;quot;&amp;quot;/&amp;quot;:deg&amp;quot;tdeg&amp;quot;t \[ o'ootior, : \[o ioc,'dg4 . H |mm,~ubcat : ( ) / \[chart :\[1,2\] \]  In this approach, the level of constituent structure at which multilnodal integration applies is the simple deictic NP. It is preferable to integrate at this level rather than the level of the deictic determiner, since other words in the simple NP will place constraints on the choice and interpretation of the gesture. For example, 'this petwon' is constrained to integrate with a gesture at a person while 'this number' is constrained to integrate with a gesture at a number.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="365" end_page="367" type="metho">
    <SectionTitle>
4 Deictic numerical expressions
</SectionTitle>
    <Paragraph position="0"> I turn now to the analysis of deictic expressions with numerals. An example command fi'om the multimodal messaging application domain is 'email these four people'. This could be handled by developing an analysis that assigns 'these four people' a multimodal subcategorization which selects for four spatial gestures at people: &lt;Gpe,..,.o,,, Gm,..,.o,,. Gp,.,,,.o,,. Gp ........ ,&gt;. Similarly, 'these two organizations' would have tile following multimodal subcategorization: &lt;Go,.~,,,,iz,tio,,, Go,.~,,,,iz,,~o,,&gt;. The multilnodal subcategorization fiame will be saturated in MP through combination with the appropriate number of individual selection gestums. The problem with this approach is that it does not account for the wide range of different gesture patterns that can be  used to refer to a set of N objects on a disphty.</Paragraph>
    <Paragraph position="1"> Single objects may be selected using pointing gestures or circling (or underlining). Circling gestures can also be used to refer to sets of objects and combinations of circling and pointing can be used to enumerate a set of entities. Figure 7 shows some of the different ways that a set of four objects can be refened to using electronic ink.</Paragraph>
    <Paragraph position="2"> The graphical layout of objects on the screen plays an ilnportant role in deterlnining the kind of gesture combinations that are likely. If three objects are close together and another further away, the least effortl'ul gesture combination is to circle the three and then circle oi point at the remaining one. If all four are close together, then it is easiest to make a single area gesture containing all four. If other objects intervene between the objects to be selected, individual selections are lnore likely since there is less risk of accidentally selecting the intervening objects. It is desirable that multimodal systems be able to handle the broad range of ways to select collections of entities so that users can utilize the and most natural gesture least effortful combination.</Paragraph>
    <Paragraph position="3">  The range of possible,gesture combinations can be captured using multimodal subcategorization as above, but this vastly complicates the SLP grammar and leads to an explosion of ambiguity.</Paragraph>
    <Paragraph position="4"> Every time a numerical expression appears a multitude of alternative multimodal subcategorization fralnes would need to be assigned to it.</Paragraph>
    <Paragraph position="5"> To address this problem, my approach is to underspecify the particular configuration of gestures in the multilnodal subcategorization o1' the deictic uumeral expression. Instead of subcategorizing for a sequence of N gestures, 'these N' subcategorizes for a collection of plurality N : &lt;G\[number:N\]&gt;. The expression 'these fi~ttr people' has subcategorization &lt;Gw.~.o,,\[mm,ber:4\]&gt;. An independent set of roles for gesture combination are used to enumerate all of the different ways to refer to a collection of entities. In simplil'ied form, the basic gesture combination rule is as in Figure 8.</Paragraph>
    <Paragraph position="6">  The rule is also constrained so that the combining gestures are adjacent in time and do not intersect with each other. The gesture combination rules will enumerate a broad range of possible gesture collections (though not as many combiuations as when they are enumerated in the mullimodal subcategorization frame). The over-application of these rules can be prevented by using predictive information from SLP; that is, if SLP parses 'these .four people' then these rules are applied to the gesture input in order to construct candidate collections of four people.</Paragraph>
    <Paragraph position="7"> 5 Integration at higher levels of constituent structure In the analysis developed above, multimodal inlegration takes place at the level of simple deictic nominal expressions. There are however nmltimodal utterances where a single gesture maps onto a higher level of constituent structure in the spoken language parse. For example, 'move from here to there' could appear with two pointing gestures, but could also very well appear with a line gesture indicating the start and end of the move. In this case, the integration coukt be kept at the level of 'here' and 'there' by introducing a rule which splits line gestures into their component start and end points (Gli,,e ---) Gi,oim Gl,,,i,,t). The problem with this approach is that it introduces points that MP could then attempt to combine with other recognition results leading to an erroneous parse of the utterance. To avoid this problem the SLP grammar can assign two possible analyses to this string. In one, both 'here' aud 'there' are passed to MP for integration with point gestures. In the other, 'fi'om here to there' is parsed in SLP  and passed to MP for integration with a line gesture. There are related examples with conjunction 'move this organization and this department here'. An encircling gesture could be used to identify 'this organization and this department' (especially if the pen is close to each object as the corresponding deictic phrase is uttered). However, if in the general case we allow SLP to generate multiple analyzes of a conjunction, there will be an explosion of possible patterns generated, just as in the case of deictic numeral expressions. To overcome this difficulty, gesture decomposition rules can be used. In order to avoid errorful combinations with other recognition results, the application of these rules in MP needs to be driven by predictive information from SLP; that is, in our example, if single gestures cannot be found to combine with 'this organization' and 'this department', then the gesture decomposition rules are applied to temporally appropriate multiple selection gestures to extract the needed individual selections. A similar approach could be used to handle 'fi'om here to there' with a controlled GI,-,,. --~ @,o~,,t Gpoi, t rule which only applies when required.</Paragraph>
    <Paragraph position="8"> Conclusion I have proposed an approach to nmltimodal language processing in which spoken language parsing and nmltimodal parsing are more tightly coupled than in the modular pipeliued approach taken in Johnston 1998. The spoken language parsing component and nmltilnodal parsing component cooperate in determining the interpretation of nmltimodal utterances. This enables multimodal integration to occur at a level of constituent structure below the verbal utterance level specifically, the simple deictic noun phrase. This greatly simplifies the development of the spoken language parsing grammar as it is no longer necessary construct a single multimodal subcategorization list for the whole utterance.</Paragraph>
    <Paragraph position="9"> Following the modular approach of Johnston ! 998a, the treatment of multimodal subcategorization permeates the whole gramlnar complicating the analysis of verb subcategorization, conjunction, possessives and inany other phenomena. This new approach also enables more detailed inodeling of temporal constraints in multi-gesture multimodal utterances. I have also argued that a deictic numeral expression should multimodally subcategorize for a collection of entities and should be underspecified with respect to the particular combination of gestures used to pick out the collection. Possible combination patterns are enumerated by gesture composition rules.</Paragraph>
    <Paragraph position="10"> Communication between SLP and MP enables predictive application of rules for gesture composition and decomposition which might otherwise over-apply.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML