Finite-state Multimodal Parsing and Understanding 
Michael Johnston 
AT&T Labs - Research 
Shannon Laboratory, 180 Park Ave 
FIorham Park, NJ 07932, USA 
j ohnston@research, att. tom 
Srinivas Bangalore 
AT&T Labs - Research 
Shannon Laboratory, 180 Park Ave 
Florham Park, NJ 07932, USA 
srini@research, art. tom 
Abstract 
Multimodal interfaces require effective parsing and 
nn(lerstanding of utterances whose content is dis- 
tributed across multiple input modes. Johnston 1998 
presents an approach in which strategies lbr mul- 
timodal integration are stated declaratively using a 
unification-based grammar that is used by a mnlti- 
dilnensional chart parser to compose inputs. This 
approach is highly expressive and supports a broad 
class of interfaces, but offers only limited potential 
for lnutual compensation among the input modes, is 
subject to signilicant concerns in terms o1' COml)uta- 
tional complexity, and complicates selection among 
alternative multimodal interpretations of the input. 
In tiffs papeh we l)resent an alternative approacla 
in which multimodal lmrsing and understanding are 
achieved using a weighted finite-state device which 
takes speech and gesture streams as inputs and out- 
puts their joint interpretation. This approach is sig- 
nificantly more efficienl, enables tight-coupling of 
multimodal understanding with speech recognition, 
and provides a general probabilistic fralnework for 
multimodal ambiguity resolution. 
1 Introduction 
Multimodal interfaces are systems that allow input 
and/or output to be conveyed over multiple different 
channels such as speech, graphics, and gesture. They 
enable more natural and effective interaction since 
different kinds of content can be conveyed in the 
modes to which they are best suited (Oviatt, 1997). 
Our specific concern here is with multimodal inter- 
faces supporting input by speech, pen, and touch, but 
the approach we describe has far broader applicabil- 
ity. These interfaces stand to play a critical role in the 
ongoing migration of interaction fi'oln the desktop 
to wireless portable computing devices (PI)As, next- 
generation phones) that offer limited screen real es- 
tale, and other keyboard-less platforms such as pub- 
lic information kiosks. 
To realize their full potential, multimodal inter- 
faces need to support not just input from multiple 
modes, but synergistic multimodal utterances opti- 
mally distributed over the available modes (John- 
ston et al., 1997). In order to achieve this, an ef 
fcctive method for integration of content fi'Oln dill 
ferent modes is needed. Johnston (1998b) shows 
how techniques from natural  processing 
(unification-based gramumrs and chart parsing) can 
be adapted to support parsing and interpretation of 
utterances distributed over multiple modes. In that 
approach, speech and gesture recognition produce ~,- 
best lists of recognition results which are assigned 
typed feature structure representations (Carpenter, 
1992) and passed to a luultidimensioual chart parsel ° 
that uses a lnultimodal unification-based granunar to 
combine the representations assigned to the input el- 
ements. Possible multimodal interpretations are then 
ranked and the optimal interpretation is passed on 
for execution. This approach overcomes many of 
the limitations of previous approaches to multimodal 
integration such as (Bolt, 1980; Neal and Shapiro, 
1991) (See (Johnston ct al., 1997)(1). 282)). It sup- 
ports speech with multiple gestures, visual parsing 
of unimodal gestures, and its dechu'ative nature fa- 
cilitates rapid l)rototyping and iterative develol)meut 
of multimodal systems. Also, the unification-based 
approach allows for mutual COlnpensatiou of recog- 
nition errors in the individual modalities (Oviatt, 
1999). 
However, the unification-based approach does not 
allow for tight-conpling of nmltimodal parsing with 
speech and gesture recognition. Compensation el L 
fects are dependent on the correct answer appear- 
ing in the ~;,-best list of interpretations assigned to 
each mode. Multimodal parsing cannot directly in- 
fluence the progress of speech or gesture recognition. 
The multidimensional parsing approach is also sub- 
ject to significant concerns in terms of computational 
complexity. In the worst case, the multidimensional 
parsing algorithm (Johnston, 1998b) (p. 626) is ex- 
ponential with respect to the number of input ele- 
ments. Also this approach does not provide a nat- 
ural fiamework for combining the probabilities of 
speech and gesture events in order to select among 
multiple competing multimodal interpretations. Wu 
et.al. (1999) present a statistical approach for select- 
ing among multiple possible combinations of speech 
369 
and gesture. However; it is not clear how the ap- 
proach will scale to more complex verbal  
and combinations of speech with multiple gestures. 
In this papm, we propose an alternative approach 
that addresses these limitations: parsing, understand- 
ing, and integration of speech and gesture am pe> 
formed by a single finite-state device. With certain 
simplifying assumptions, multidimensional parsing 
and understanding with multimodal grammars can 
be achieved using a weighted finite-state automa- 
ton (FSA) running on throe tapes which represent 
speech input (words), gesture input (gesture sym- 
bols and reference markers), and their combined in- 
terpretation. We have implemented our approach in 
the context of a multimodal messaging application 
in which users interact with a company directo W 
using synergistic combinations of speech and pen 
input; a multimodal variant of VPQ (Buntschuh et 
al., 1998). For example, the user might say email 
this person and this person and gesture 
with the pen on pictures of two people on a user inter- 
face display. In addition to the user interface client, 
the architecture contains speech and gesture recog- 
nition components which process incoming streams 
of speech and electronic ink, and a multimodal lan- 
guage processing component (Figure 1 ). 
u, \[ 
ASR ~ I ~esture Recognizer \[ 
Multimodal Parser/Understander \] 
Backend 
Figure 1: Multimodal alvhitecture 
Section 2 provides background on finite-state lan- 
guage processing. In Section 3, we define and exem- 
plify multimodal context-fiee grammars (MCFGS) 
and their approximation as multimodal FSAs. We 
describe our approach to finite-state representation 
of meaning and explain how the three-tape finite 
state automaton can be factored out into a number 
of finite-state transducers. In Section 4, we explain 
how these transducers can be used to enable tight- 
coupling of multimodal  processing with 
speech and gesture recognition. 
2 Finite-state Language Processing 
Finite-state transducers (FST) are finite-state au- 
tomata (FSA) where each transition consists of an 
input and an output symbol. The transition is tra- 
versed if its input symbol matches the current sym- 
bol in the input and generates the output symbol as- 
sociated with the transition. In other words, an FST 
can be regarded as a 2-tape FSA with an input tape 
from which the input symbols are read and an output 
tape where the output symbols are written. 
Finite-state machines have been extensively ap- 
plied to many aspects of  processing in- 
cluding, speech recognition (Pereira and Riley, 1997; 
Riccardi et al., 1996), phonology (Kaplan and Kay, 
1994), morphology (Koskenniemi, 1984), chunk- 
ing (Abney, 1991; Joshi and Hopely, 1997; Ban- 
galore, 1997), parsing (Roche, 1999), and machine 
translation (Bangalore and Riccardi, 2000). 
Finite-state models are attractive n~echanisms for 
 processing since they are (a) efficiently 
learnable fiom data (b) generally effective for decod- 
ing and (c) associated with a calculus for composing 
machines which allows for straightforward integra- 
tion of constraints fl'om various levels of  
processing. Furdmrmore, software implementing 
the finite-state calculus is available for research pur- 
poses (Mohri eta\[., 1998). Another motivation for 
our choice of finite-state models is that they enable 
tight integration of  processing with speech 
and gesture recognition. 
3 Finite-state MultimodalGrammars 
Multimodal integration involves merging semantic 
content fi'om multiple streams to build a joint inter- 
pretation for a inultimodal utterance. We use a finite- 
state device to parse multiple input strealns and to 
combine their content into a single semantic repre- 
sentation. For an interface with n inodes, a finite- 
state device operating over n+ 1 tapes is needed. The 
first n tapes represent the input streams and r~ + \] is 
an output stream representing their composition. In 
the case of speech and pen input there are three tapes, 
one for speech, one for pen gesture, and a third for 
their combined meaning. 
As an example, in the messaging application 
described above, users issue spoken commands 
such as email this person and that 
organization and gestm'e on the appropriate 
person and organization on the screen. The struc- 
ture and interpretation of multimodal colnlnands of 
this kind can be captured declaratively in a multi- 
modal context-free grammar. We present a fi'agment 
capable of handling such commands in Figure 2. 
370 
S .~ V NP g:c:\]) NP -+ I)ET N 
CONJ --4 and:E:, NP --+ I)ET N CONJ NP 
V -+ cmail:g:cmail(\[ DET --+ |his:g:c 
V -+ page:c:page(\[ I)ET --+ lhat:¢:c 
N --:. person:Gp:person( ENTP, Y 
N -4 organization:Go:org( ENTRY 
N --+ dcpartment:Gd:dept( ENTRY 
ENTRY -> C:el :el c:g:) 
ENTRY -> c:e2:e2 c:g:) 
ENTRY -4 c:ea:ea g:e:) 
ENTP, Y --+ ... 
Figure 2: Multimodal grammar fragment 
The non-terminals in the multimodal grammar are 
atomic symbols. The multimodal aspects el' the 
grammar become apparent in the terlninals. Each 
terminal contains three components W:G:M corre- 
sponding to the n q- 1 tapes, where W is for the spo- 
ken  stream, G is the gesture stream, and 
M is the combined meaning. The epsilon symbol is 
used to indicate when oue of these is empty in a given 
terminal. The symbols in W are woMs from the 
speech stream. The symbols in G are of two types. 
Symbols like Go indicate the presence of a particular 
kind of gesturc in the gesture stream, while those like 
et are used as references to entities referred to by the 
gesture (See Section 3.1). Simple deictic pointing 
gestures are assigned semantic types based on tl~e en- 
tities they are references to. Gp represents a gestural 
tel'erence to a person on the display, Go to an orga- 
nization, and Gd lo a department. Compared with 
a feature-based multimodal gralnlnar, these types 
constitute a set of atomic categories which make 
ltle relewmt dislinclions for gesture events prcdicl- 
lug speech events and vice versa. For example, if 
the gesture is G,, then phrases like thLs person 
aud him arc preferred speech events and vice versa. 
These categories also play a role in constraining the 
semantic representation when the speech is under- 
specified with respect to semantic type (e.g. email 
this one). These gesture symbols can be orga- 
nized into a type hierarchy reflecting the ontology 
of the entities in the application domain. For exam- 
pie, there might be a general type G with subtypes 
Go and Gp, where G v has subtypes G,,,,~ and Gpf for 
male and female. 
A multimodal CFG (MCFG) can be defined fop 
really as quadruple < N, 7', P, S >. N is the set of 
nonterminals. 1 ~ is the set of productions of the form 
A -+ (~whereA E Nand,~, C (NUT)*. Sis 
the start symbol for the grammar. 7' is the set ot' ter- 
minals of the l'orm (W U e) : (G U e) : M* where 
W is the vocabulary of speech, G is the vocabulary 
of gesture=GestureSymbols U EventSymbols; 
GcsturcSymbols ={G v, Go, Gpj', G~.., ...} and 
a finite collections of \],gventSymbols ={c,,c~, 
..., c,,}. M is the vocabulary to lel)rcsent meaning 
and includes event symbols (Evenl:Symbol.s C M). 
In general a context-free grammar can be approx- 
imated by an FSA (Pereira and Wright 1997, Neder- 
her 1997). The transition symbols of the approx- 
imated USA are the terminals of the context-fiee 
grammar and in the case of multimodal CFG as de- 
tined above, these terminals contain three compo- 
nents, W, G and M. The multimodal CFG fi'ag- 
merit in Figurc 2 translates into the FSA in Figure 3, 
a three-tape finite state device capable of composing 
two input streams into a single output semantic rep- 
resentation stream. 
Our approach makes certain simplil'ying assump- 
tions with respect to ternporal constraints. In multi- 
gesture utterances the primary flmction of tempo- 
ral constraints is to force an order on the gestures. 
If you say move this here and make two .ges- 
tures, the first corresponds to thi s and the second to 
here. Our multimodal grammars encode order but 
do not impose explicit temporal constraints, ltow- 
ever, general temporal constraints between speech 
and the first gesture can be enforced belbrc the FSA 
is applied. 
3.1 Finite-state Meaning Representation 
A novel aspect of our approach is that in addition 
to capturing the structure of  with a finite 
state device, we also capture meaning. Tiffs is very 
important in nmltimodal  processing where 
the central goal is to capture how the multiple modes 
contribute to the combined interpretation. Ottr ba- 
sic approach is to write symbols onto the third tape, 
which when concatenated together yield the seman- 
tic representation l'or the multimodal utterance. It 
suits out" purposes here to use a simple logical repre- 
sentation with predicates pred(....) and lists la, b,...l. 
Many other kinds of semantic representation could 
be generated. In the fl'agment in Figure 2, the word 
ema±l contributes email(\[ to the semantics tape, 
and the list and predicate arc closed when the rule 
S --+ V NP e:z:\]) applies. The word person 
writes person( on the semantics tape. 
A signiiicant problem we face in adding mean- 
ing into the finite-state framework is how to reprc- 
sent all of the different possible specific values that 
can be contributed by a gesture. For deictic refer- 
ences a unique identitier is needed for each object in 
the interface that the user can gesture on. For ex- 
alnple, il' the interface shows lists of people, there 
needs to be a unique ideutilier for each person. As 
part of the composition process this identifier needs 
371 
departmcnl:Gd:dept( cps:cl :el 
or,mnization:Go:or-( tnat:eps:eps ~ z } ~ ~ .\[ 3 ~ eps:eZ:e2 ~_ 
/ / ~ e~q'e~ ~.\ " \]--.~cps:el~s:) 
+:°,,+. 
and:eps:, 
Figure 3: Multimodal three-tape FSA 
to be copied from the gesture stream into the seman- 
tic representation. In the unification-based approach 
to multimodal integration, this is achieved by fea- 
ture sharing (Johnston, 1998b). In the finite-state ap- 
proach, we would need to incorporate all of the dif- 
ferent possible IDs into the FSA. For a person with 
id objid345 you need an arc e:objid345:objid345 
to transfer that piece of information fiom the ges- 
ture tape to the lneaning tape. All of the arcs for 
different IDs would have to be repeated everywhere 
in the network where this transfer of information is 
needed. Furthermore, these arcs would have to be 
updated as the underlying database was changed or 
updated. Matters are even worse for more complex 
pen-based data such as drawing lines and areas in an 
interactive map application (Cohen et al., 1998). In 
this case, the coordinate set from the gesture needs 
to be incorporated into the senmntic representation. 
It might not be practical to incorporate the vast nuln- 
bet of different possible coordinate sequences into an 
FSA. 
Our solution to this problem is to store these 
specific values associated with incoming gestures 
in a finite set of buffers labeled el,e,),ea .... and 
in place of the specific content write in the nalne 
of the appropriate buffer on the gesture tape. In- 
stead of having the specific values in the FSA, we 
have the transitions E:CI:C\], C:C2:C2, s:e3:e:3.., in 
each location where content needs to be transferred 
from the gesture tape to the meaning tape (See Fig- 
ure 3). These are generated fi'om the ENTRY pro- 
ductions in the multilnodal CFG in Figure 2. The 
gesture interpretation module empties the buffers 
and starts back at el after each multimodal com- 
mand, and so we am limited to a finite set of ges- 
ture events in a single utterance. Returning to 
the example email this person and that 
organization, assume the user gestures on en- 
tities objid367 and objid893. These will be stored 
in buffers el and e2. Figure 4 shows the speech and 
gesture streams and the resulting combined meaning. 
The elements on the meaning tape are concate- 
nated and the buffer references are replaced to yield 
S: email this person and that organization 
G: Gp cl 'Go e2 
M: email(\[ person(ct) , org(c2) \]) 
Figure 4: Messaging domain example 
email(~)er.son(objid367), or.q(objidS93)\]). As 
more recursive semantic phenomena such as pos- 
sessives and other complex noun phrases are added 
to the grammar the resulting machines become 
larger. However, the computational consequences 
of this can be lessened by lazy ewfluation tech- 
niques (Mohri, 1997) and we believe that this finite- 
state approach to constructing semantic representa- 
tions is viable for a broad range of sophisticated lan- 
guage interface tasks. We have implemented a size- 
able multimodal CFG for VPQ (See Section 1): 417 
rules and a lexicon of 2388 words. 
3.2 Multimodal Finite-state Transducers 
While a three-tape finite-state automaton is feasi- 
ble in principle (Rosenberg, 1964), currently avail- 
able tools for finite-state  processing (Mohri 
et al., 1998) only support finite-state transducers 
(FSTs) (two tapes). Furthermore, speech recogniz- 
ers typically do not support tile use of a three-tape 
FSA as a  model. In order to implement our 
approach, we convert the three-tape FSA (Figme 3) 
into an FST, by decomposing the transition symbols 
into an input component (G x W) and output compo- 
nent M, thus resulting in a function, T:(G x W) --+ 
M. This corresponds to a transducer in which ges- 
ture symbols and words are on the :input tape and the 
meaning is on the output tape (Figure 6). The do- 
main of this function T can be further curried to re- 
sult in a transducer that maps 7~:G --> W (Figure 7). 
This transducer captures the constraints that gesture 
places on the speech stream and we use it as a Jan- 
guage model for constraining the speech recognizer 
based on the recognized gesture string. In the fop 
lowing section, we explain how "F and 7% are used in 
conjunction with the speech recognition engine and 
gesture recognizer and interpreter to parse and inter- 
372 
pret nmltimodal input. 
4 Applying Multimodal Transducers 
There arc number of different ways in which multi- 
modal finite-state transducers can be integrated with 
speech and gesture recognition. The best approach 
to take depends on the properties of the lmrticular 
interface to be supported. The approach we outline 
here involves recognizing gesture ilrst then using the 
observed gestures to modify the  model for 
speech recognition. This is a good choice if there 
is limited ambiguity in gesture recognition, for ex- 
an@e, if lhe m~jority of gestures are unambiguous 
deictic pointing gestures. 
The first step is for the geslure recognition and 
interpretation module to process incoming pen ges- 
tures and construct a linite state machine GeslltVe 
corresponding to the range of gesture interpretations. 
Ill our example case (Figure 4) tile gesture input is 
unambiguous and the Gestttre linite state machine 
will be as in Figure 5. \]f the gestural input involves 
gesture recognition or is otherwise ambiguous it is 
represented as a lattice indicating all of the possi- 
ble recognitions and interpretations o1' tile gesture 
stream. This allows speech to compensate for ges- 
ture errors and mutual compensation. 
Figure 5: (;eslttre linite-smte machine 
This Ge,s'lure linite state machine is then com- 
posed with the transducer "R, which represents the 
relationship between speech and gesture (Figure 7). 
The result of this composition is a transducer Gesl- 
Lang (Figure 8). This transducer represents the re- 
lationship between this particular sl.ream of gestures 
and all of the possible word sequences tlmt could co- 
occur with those oes" , rares. In order to use this in- 
lbnnation to guide the speech recognizer, we lhcn 
take a proiection on the output tape (speech) of Gesl- 
Lang to yield a finite-state machine which is used 
as a hmguage model for speech recognition (Fig- 
ure 9). Using this model enables the gestural in- 
formation to directly influence the speech recog- 
nizer's search. Speech recognition yields a lattice 
of possible word sequences. In our example case it 
yMds the wol~.t sequence email this person 
and that organization (Figure 10). We 
now need to reintegrale the geslure inl'ormation that 
wc removed in the prqjection step before recog- 
nition. This is achieved by composing Gest- 
Lang (Figure 8) with the result lattice from speech 
recognition (Figure 10), yielding transducer Gesl~ 
&)eechFST (Figure 11). This transducer contains 
the information both from the speech stream and 
from the gesture stream. The next step is to gen- 
erate the Colnbined meaning representation. To 
achieve this Gest&)eechFST (G : W) is converted 
into an FSM GestSpeechFSM by combining out- 
put and input on one tape (G x W) (Figure 12). 
GestSk)eeckFSM is then composed with T (Fig- 
ure 6), which relates speech and gesture to mean- 
ing, yielding file result transducer Result (Figure 13). 
The meaning is lead from the output tape yield- 
ing cm,dl(\[perso,,,(ca), m'O(e2)\]). We have imple- 
mented lifts approach and applied it in a multimodal 
interface to VPQ on a wireless PDA. In prelilni- 
nary speech recognition experiments, our approach 
yielded an average o1' 23% relative sentence-level er- 
ror reduction on a corpus of 1000 utterances (John- 
ston and Bangalore, 2000). 
5 Conclusion 
We have presented here a novel approach to muI- 
timodal hmguage processing in which spoken lan- 
guage and gesture are parsed and integrated by a 
single weighted lhfite-state device. This device pro- 
vides  models for speech and gesture recog- 
nition alld colllposes content from speech and gcs- 
lure into a single semantic representalion. Our ap- 
proach is novel not just in addressing multimodal 
hmguage but also in the encoding of semantics as 
well as syntax in a finile-state device. 
Compared to previous al~proaches (Johnston el al., 
1997; Jolmston, 1998a; Wu et al., 1999) which com- 
pose elements from 'n.-best lists of recognition re- 
sults, our approach provides an unprecedenled po- 
tential for mutual compensation among the input 
modes. It enables gestural input to dynamically 
alter the hmguage model used tbr speech recogni- 
lion. Furthermore, our approach avoids the com- 
putational complexity of multidimensional multi- 
modal parsing and our system of weighted finite- 
stale transducers provides a well understood prob- 
abilistic framcwork for combining the probability 
distributions associated with speech and gesture in- 
put and selecting among multiple competing nmlti- 
modal interpretations. Since the finite-state approach 
is more lightweight in coml)utational needs, it can 
more readily be deployed on a broader range of plat- 
forms. 
In ongoing research, we are collecting a corpus of 
multimodal data ill order to forlnally evahmte the ef- 
fectiveness of our approach and to train weights for 
1he multimodal linile-state transducers. While we 
have concentrated here on understanding, in princi- 
ple the same device could be applied to multimodal 
373 
Gd_dcpartnlcnt:dept( c I_cps:e 1 
. ~ Go or,,anization:or,,( ells tnat:eps ~ z j - ~ ~ ~ a b__...______cz cps:cz 
, ,~ op~y,:om~< ~._:pg_*>~_>/ -_______/ -- -"'W':', 
ells_and:, ~ -- 
ells:l) .© 
Figure 6: Transducer relating gesture and speech to meaning (7-':(G x W) -- M) 
Gd:departmcnt e 1 :eps 
/f~'~/'~ Go:organization cps:that ~ z } "~ -~ }-........_.cz:cl)s 
~.."/~"MQI~S°'--J~--J"JNN.. e3:ep*-"-"'""~"s _.,,-((43) 
('7') ~p,:~ma,, ~,~:y>./ ~p,:a,,d "----------~'~ 
-..,..j......__eps:pags_.......ac-..__J - __ 
Figure 7: Transducer relating gesture and speech (TE:G ---+ W) 
eps:elllail eps:lhal Gp:person 
Figure 8: GestLang Transducer 
(}o:lu'ganizalion 
u page ~ this - - 
Figure 9: Projection of Output tape of GestLang Transducer 
@ email "@ this .@ person .~@ and .@ that =@ 
Figure 10: Result from speech recognizer 
Figure 11: GestureSpeechFST 
Figure 12: GestureSpeech FSM 
organization ~@ 
organization _- 
Q ~,,se,l,a,l:Olll,i,~,~.q) ot,s_,,,is:e,,, >@ o,)-p~rso'l:,'e~go"~> G ~,¢,,s:el >q) 
EllS_all(l:, 
Figure 13: Result Transducer 
eps:) ~, Q~ 
~i,.:) >(~) ep,:\]) >(~) 
374 
generation which we are currently investigating. We 
are also exploring teclmiques to extend compilation 
fi'om feature structures gralnnlars to FSTs (Johnson, 
19!)8) to nmltimodal unification-based grammars. 

References 

Steven Abney. 1991. Parsing by chunks. In Robert 
Berwick, Steven Abney, and Carol Tenny, editors, 
Principle-based palwing. Kluwer Academic Pub- 
lishers. 

Srinivas Bangalore and Giuseppe Riccardi. 2000. 
Stochastic lhfite-state models for spoken  
machine translation. In Proceedings o/" the Work- 
shop on Embedded Machine Translation Systems. 

Srinivas Bangalore. 1997. ComplexiO, of Lexic.al 
Descriptions and its Relevance to Partial Pmw- 
ing. Ph.l). tlaesis, University of Pennsylwmia, 
t~hiladelphia, PA, August. 

Robert A. Bolt. 1980. "put-thal-there":voicc and 
gesture at the graphics interface. Computer 
Graphics, 14(3):262-270. 

Bruce Buntschuh, C. Kamm, G. DiFabbrizio, 
A. Abella, M. Mohri, S. Narayanan, I. Zel.ikovic, 
R.D. Sharp, J. Wright, S. Marcus, J. Shaffer, 
R. I)uncan, and J.G. Wilpon. 1998. Vpq: A 
spoken  interface to large scale directory 
information. In Proceedin,q,s o/' ICSLI', Sydney, 
Australia. 

Robert Carpenter. 1992. The logic qf OT)ed./~'alure 
structures. Cambridge University Press, England. 

Philip R. Cohen, M. Johnston, 1). McGee, S. L. 
Oviatt, J. Pittman, I. Smith, L. Chen, and 
J. Clew. 1998. Multimodal interaction for dis- 
tributed interactive simulation. In M. Maybury 
and W. Wahlster, editors, Readings itz Intelligent 
httelfiwes. Morgan Kaul'mann Publishers. 

Mark Jollnson. 1998. Finite-state approximation 
of constraint-based grammars using left-corner 
grammar transforms. In Proceedings q/'COLING- 
ACL, pages 619-623, Montreal, Canada. 

Michael Johnston and Srinivas Bangalore. 2000. 
Tight-coupling of multimodal  process- 
ing with speech recognition. Technical report, 
AT&T Labs - Reseamh. 

Michael Johnston, ER. Cohen, D. McGee, S.L. Ovi- 
att, J.A. Pittman, and 1. Smidl. 1997. Unilication- 
based multimodal integration. In Proceedings o/ 
lhe 35th ACL, pages 281-288, Madrid, Spain. 

Michael Johnston. 1998a. Mullimodal  
processing. In Proceedings q/" ICSLP, Sydney, 
Australia. 

Michael Johnston. 1998b. Unification-based multi- 
modal parsing. In Proceedings of COLING-ACL, 
pages 624-630, Montreal, Canada. 

Aravind Joshi and Philip Hopely. 1997. A parser 
fiom antiquity. Natural Language Engilzeering, 
2(4). 

Ronald M. Kaplan and M. Kay. 1994. Regular mod- 
els of phonological rule systems. Computational 
Linguislics, 20(3):331-378. 

K. K. Koskenniemi. 1984. 7ire-level morphology: a 
general computation model,for wordzform recog- 
nition and production. Ph.D. thesis, University of 
He\[sinki. 

Mehryar Mohri, Fernando C. N. Pereira, and 
Michael Riley. 1998. A rational design for a 
weighted .finite-state transducer librao,. Num- 
ber 1436 in Lecture notes in computer science. 
Springm; Berlin ; New York. 

Mehryar Mohri. 1997. Finite-state transducers in 
 and speech processing. (7Oml~utational 
Linguistics, 23(2):269-312. 

J. G. Neal and S. C. Shapiro. 1991. Intelligent multi- 
media interface technology. In J. W. Sulliwm and 
S. W. Tylm, editors, Intelligent User lnter\['aces, 
pages 45-68. ACM Press, Addison Wesley, New 
York. 

Sharon L. Oviatt. 1997. Multimodal interactive 
maps: l)esigning l'or human performance. In 
Hmmut-Computer Interaction, pages 93-129. 

Sharon L. Ovialt. 1999. Mutual disambiguation of 
recognition errors in a inultimodal architecture. In 
Cltl '99, pages 576-583. ACM Press, New York. 

Fernando C.N. Pereira and Michael I). Riley. 1997. 
Speech recognition by composition of weighted fi- 
nite automata. In E. Roche and Schabes Y., ed- 
itors, Finite State Devices for Nalttral Language 
Processitlg, pages 431-456. MIT Press, Cam- 
bridge, Massachusetts. 

Giuseppe Riccardi, R. Pieraccini, and E. Bocchieri. 
1996. Stochastic Automata for Language Model- 
ing. Computer Speech and Language, 10(4):265- 
293. 

Emmanuel Roche. 1999. Finite state transducers: 
parsing free and fl'ozen sentences. In Andrfis Ko- 
rnai, editol, Extended Finite State Models el'Lan- 
guage. Cambridge University Press. 

A.L. Rosenberg. 1964. On n-tape finite state accep- 
ters. FOCS, pages 76-81. 

Lizhong Wu, Sharon L. Oviatt, and Philip R. Cohen. 
1999. Multilnodal integration - a statistical view. 
IEEE Transactions on Multimedia, I (4):334-34 l, 
I)ecember. 
