Balancing Data-driven and Rule-based Approaches in the Context of a
Multimodal Conversational System
Srinivas Bangalore
AT&T Labs-Research
180 Park Avenue
Florham Park, NJ 07932
srini@research.att.com
Michael Johnston
AT&T Labs-Research
180 Park Avenue
Florham Park, NJ 07932
johnston@research.att.com
Abstract
Moderate-sized rule-based spoken language
models for recognition and understanding are
easy to develop and provide the ability to
rapidly prototype conversational applications.
However, scalability of such systems is a bot-
tleneck due to the heavy cost of authoring and
maintenance of rule sets and inevitable brittle-
ness due to lack of coverage in the rule sets.
In contrast, data-driven approaches are robust
and the procedure for model building is usu-
ally simple. However, the lack of data in a par-
ticular application domain limits the ability to
build data-driven models. In this paper, we ad-
dress the issue of combining data-driven and
grammar-based models for rapid prototyping
of robust speech recognition and understanding
models for a multimodal conversational sys-
tem. We also present methods that reuse data
from different domains and investigate the lim-
its of such models in the context of a particular
application domain.
1 Introduction
In the past four decades of speech and natural language
processing, both data-driven approaches and rule-based
approaches have been prominent at different periods in
time. In the recent past, rule-based approaches have
fallen into disfavor due to their brittleness and the sig-
nificant cost of authoring and maintaining complex rule
sets. Data-driven approaches are robust and provide a
simple process of developing applications given the data
from the application domain. However, the reliance on
domain-specific data is also one of the significant bottle-
necks of data-driven approaches. Development of a con-
versational system using data-driven approaches cannot
proceed until data pertaining to the application domain is
available. The collection and annotation of such data is
extremely time-consuming and tedious, which is aggra-
vated by the presence of multiple modalities in the user’s
input, as in our case. Also, extending an existing applica-
tion to support an additional feature requires adding ad-
ditional data sets with that feature.
In this paper, we explore various methods for combin-
ing rule-based and in-domain data for rapid prototyping
of speech recognition and understanding models that are
robust to ill-formed or unexpected input in the context
of a multimodal conversational system. We also investi-
gate approaches to reuse out-of-domain data and compare
their performance against the performance of in-domain
data-driven models.
We investigate these issues in the context of a multi-
modal application designed to provide an interactive city
guide: MATCH. In Section 2, we present the MATCH
application, the architecture of the system and the appa-
ratus for multimodal understanding. In Section 3, we dis-
cuss various approaches to rapid prototyping of the lan-
guage model for the speech recognizer and in Section 4
we present two approaches to robust multimodal under-
standing. Section 5 presents the results for speech recog-
nition and multimodal understanding using the different
approaches we consider.
2 The MATCH application
MATCH (Multimodal Access To City Help) is a work-
ing city guide and navigation system that enables mo-
bile users to access restaurant and subway information
for New York City (NYC) (Johnston et al., 2002b; John-
ston et al., 2002a). The user interacts with a graphical in-
terface displaying restaurant listings and a dynamic map
showing locations and street information. The inputs can
be speech, drawing on the display with a stylus, or syn-
chronous multimodal combinations of the two modes.
The user can ask for the review, cuisine, phone number,
address, or other information about restaurants and sub-
way directions to locations. The system responds with
graphical callouts on the display, synchronized with syn-
thetic speech output. For example, if the user says phone
numbers for these two restaurants and circles two restau-
rants as in Figure 1 [a], the system will draw a callout
with the restaurant name and number and say, for exam-
ple Time Cafe can be reached at 212-533-7000, for each
restaurant in turn (Figure 1 [b]). If the immediate en-
vironment is too noisy or public, the same command can
be given completely in pen by circling the restaurants and
writing phone.
Figure 1: Two area gestures
2.1 MATCH Multimodal Architecture
The underlying architecture that supports MATCH con-
sists of a series of re-usable components which commu-
nicate over sockets through a facilitator (MCUBE) (Fig-
ure 2). Users interact with the system through a Multi-
modal User Interface Client (MUI). Their speech and ink
are processed by speech recognition (Sharp et al., 1997)
(ASR) and handwriting/gesture recognition (GESTURE,
HW RECO) components respectively. These recognition
processes result in lattices of potential words and ges-
tures. These are then combined and assigned a mean-
ing representation using a multimodal finite-state device
(MMFST) (Johnston and Bangalore, 2000; Johnston et
al., 2002b). This provides as output a lattice encoding all
of the potential meaning representations assigned to the
user inputs. This lattice is flattened to an N-best list and
passed to a multimodal dialog manager (MDM) (John-
ston et al., 2002b), which re-ranks them in accordance
with the current dialogue state. If additional informa-
tion or confirmation is required, the MDM enters into a
short information gathering dialogue with the user. Once
a command or query is complete, it is passed to the mul-
timodal generation component (MMGEN), which builds
a multimodal score indicating a coordinated sequence of
graphical actions and TTS prompts. This score is passed
back to the Multimodal UI (MUI). The Multimodal UI
coordinates presentation of graphical content with syn-
thetic speech output using the AT&T Natural Voices TTS
engine (Beutnagel et al., 1999). The subway route con-
straint solver (SUBWAY) identifies the best route be-
tween any two points in New York City.
Figure 2: Multimodal Architecture
2.2 Multimodal Integration and Understanding
Our approach to integrating and interpreting multimodal
inputs (Johnston et al., 2002b; Johnston et al., 2002a) is
an extension of the finite-state approach previously pro-
posed (Bangalore and Johnston, 2000; Johnston and Ban-
galore, 2000). In this approach, a declarative multimodal
grammar captures both the structure and the interpreta-
tion of multimodal and unimodal commands. The gram-
mar consists of a set of context-free rules. The multi-
modal aspects of the grammar become apparent in the
terminals, each of which is a triple W:G:M, consisting
of speech (words, W), gesture (gesture symbols, G), and
meaning (meaning symbols, M). The multimodal gram-
mar encodes not just multimodal integration patterns but
also the syntax of speech and gesture, and the assignment
of meaning. The meaning is represented in XML, facil-
itating parsing and logging by other system components.
The symbol SEM is used to abstract over specific content
such as the set of points delimiting an area or the identi-
fiers of selected objects. In Figure 3, we present a small
simplified fragment from the MATCH application capa-
ble of handling information seeking commands such as
phone for these three restaurants. The epsilon symbol (a0 )
indicates that a stream is empty in a given terminal.
CMD a1 a0 :a0 :a2 cmda3 INFO a0 :a0 :a2 /cmda3
INFO a1 a0 :a0 :a2 typea3 TYPE a0 :a0 :a2 /typea3
for:a0 :a0a4a0 :a0 :a2 obja3 DEICNPa0 :a0 :a2 /obja3
TYPE a1 phone:a0 :phone a5 review:a0 :review
DEICNP a1 DDETPL a0 :area:a0a4a0 :sel:a0 NUM HEADPL
DDETPL a1 these:G:a0 a5 those:G:a0
HEADPL a1 restaurants:rest:a2 resta3 SEM:SEM:a0a6a0 :a0 :a2 /resta3
NUM a1 two:2:a0 a5 three:3:a0 ... ten:10:a0
Figure 3: Multimodal grammar fragment
Speech:    
Gesture:
<type><info><cmd>
SEM(points...)
phone
<rest>
Meaning: 
<rest>
<obj></type>
ten
2
sel
locareaG
SEM(r12,r15)
restaurantstwotheseforphone
</obj></rest>r12,r15 </info> </cmd>
Figure 4: Multimodal Example
In the example above where the user says phone for
these two restaurants while circling two restaurants (Fig-
ure 1 [a]), assume the speech recognizer returns the lat-
tice in Figure 4 (Speech). The gesture recognition com-
ponent also returns a lattice (Figure 4, Gesture) indicat-
ing that the user’s ink is either a selection of two restau-
rants or a geographical area. The multimodal grammar
(Figure 3) expresses the relationship between what the
user said, what they drew with the pen, and their com-
bined meaning, in this case Figure 4 (Meaning). The
meaning is generated by concatenating the meaning sym-
bols and replacing SEM with the appropriate specific con-
tent: a2 cmda3a7a2 infoa3a8a2 typea3 phone a2 /typea3a7a2 obja3
a2 resta3 [r12,r15] a2 /resta3a9a2 /obja3a10a2 /infoa3a11a2 /cmda3 .
For the purpose of evaluation of concept accuracy, we
developed an approach similar to (Boros et al., 1996)
in which computing concept accuracy is reduced to com-
paring strings representing core contentful concepts. We
extract a sorted flat list of attribute value pairs that repre-
sents the core contentful concepts of each command from
the XML output. The example above yields the following
meaning representation for concept accuracy.
a0a2a1a4a3a6a5a8a7a10a9a8a11a13a12a15a14a17a16a19a18a21a20a22a5a23a18a17a24a21a12a25a9a21a20a26a12a28a27a8a29a17a20a21a0a23a14a6a5a13a30a28a20a13a31a32a20a21a0a23a14a33a7a25a12a25a9 (1)
The multimodal grammar can be used to create lan-
guage models for ASR, align the speech and gesture re-
sults from the respective recognizers and transform the
multimodal utterance to a meaning representation. All
these operations are achieved using finite-state transducer
operations (See (Bangalore and Johnston, 2000; John-
ston and Bangalore, 2000) for details). However, this ap-
proach to recognition needs to be more robust to extra-
grammaticality and language variation in user’s utter-
ances and the interpretation needs to be more robust to
speech recognition errors. We address these issues in the
rest of the paper.
3 Bootstrapping Corpora for Language
Models
The problem of speech recognition can be succinctly rep-
resented as a search for the most likely word sequence
(a34 ) through the network created by the composition of a
language of acoustic observations (a35 ), an acoustic model
which is a transduction from acoustic observations to
phone sequences (a36 ), a pronounciation model which is
a transduction from phone sequences to word sequences
(a37 ), and a language model acceptor (a38 ) (Pereira and Ri-
ley, 1997). The language model acceptor encodes the
(weighted) word sequences permitted in an application.
a39a19a40a42a41a17a43a44a39a32a45
a46 a47a49a48
a50
a35a52a51a53a36a54a51a53a37a55a51a53a38a57a56
a50
a34a58a56 (2)
Typically, a38 is built using either a hand-crafted gram-
mar or using a statistical language model derived from a
corpus of sentences from the application domain. While
a grammar could be written so as to be easily portable
across applications, it suffers from being too prescrip-
tive and has no metric for relative likelihood of users’
utterances. In contrast, in the data-driven approach a
weighted grammar is automatically induced from a cor-
pus and the weights can be interpreted as a measure for
relative likelihood of users’ utterances. However, the re-
liance on a domain-specific corpus is one of the signif-
icant bottlenecks of data-driven approaches, since col-
lecting a corpus specific to a domain is an expensive and
time-consuming task.
In this section, we investigate a range of techniques
for producing a domain-specific corpus using resources
such as a domain-specific grammar as well as an out-of-
domain corpus. We refer to the corpus resulting from
such techniques as a domain-specific derived corpus in
contrast to a domain-specific collected corpus. The idea
is that the derived domain-specific corpus would obvi-
ate the need for in-domain corpus collection. In partic-
ular, we are interested in techniques that would result
in corpora such that the performance of language mod-
els trained on these corpora would rival the performance
of models trained on corpora collected specifically for a
specific domain. We investigate these techniques in the
context of MATCH.
We use the notation a59a61a60 for the corpus, a62a4a60 for the lan-
guage model built using the corpus a59a63a60 , and a38a65a64a10a66 for the
language model acceptor representation of the model a62 a60 ,
which can be used in Equation 2 above.
3.1 Language Model using in-domain corpus
In order to evaluate the MATCH system, we collected a
corpus of multimodal utterances for the MATCH domain
in a laboratory setting from a set of sixteen first time
users (8 male, 8 female). We use this corpus to estab-
lish a point of reference to compare the models trained on
derived corpora against models trained on an in-domain
corpus. A total of 833 user interactions (218 multimodal
/ 491 speech-only / 124 pen-only) resulting from six sam-
ple task scenarios involving finding restaurants of various
types and getting their names, phones, addresses, or re-
views, and getting subway directions between locations
were collected and annotated. The data collected was
conversational speech where the users gestured and spoke
freely. We built a class-based trigram language model
(a62a49a67a22a68a70a69a4a71a73a72 ) using the 709 multimodal and speech-only
utterances as the corpus (a59a61a67a22a68a70a69a4a71a73a72 ). The performance
of this model serves as the point of reference to compare
the performance of language models trained on derived
corpora.
3.2 Grammar as Language Model
The multimodal CFG (a fragment is presented in Sec-
tion 2) encodes the repertoire of language and ges-
ture commands allowed by the system and their com-
bined interpretations. The CFG can be approximated by
an FSM with arcs labeled with language, gesture and
meaning symbols, using well-known compilation tech-
niques (Nederhof, 1997). The resulting FSM can be pro-
jected on the language component and can be used as
the language model acceptor (a38a57a74a76a75a78a77a80a79 ) for speech recog-
nition. Note that the resulting language model acceptor
is unweighted if the grammar is unweighted and suffers
from not being robust to language variations in user’s in-
put. However, due to the tight coupling of the grammar
used for recognition and interpretion, every recognized
string can be assigned an interpretation (though it may
not necessarily be the intended interpretation).
3.3 Grammar-based N-gram Language Model
As mentioned earlier, a hand-crafted grammar typically
suffers from the problem of being too restrictive and in-
adequate to cover the variations and extra-grammaticality
of user’s input. In contrast, an N-gram language model
derives its robustness by permitting all strings over an al-
phabet, albeit with different likelihoods. In an attempt
to provide robustness to the grammar-based model, we
created a corpus (a59 a74a76a75a78a77a76a79 ) of a81 sentences by randomly
sampling the set of paths of the grammar (a37 a50a83a82 a56 ) and
built a class-based N-gram language model(a62 a74a76a75a78a77a80a79 ) us-
ing this corpus. Although this corpus might not represent
the true distribution of sentences in the MATCH domain,
we are able to derive some of the benefits of N-gram lan-
guage modeling techniques. This technique is similar to
Galescu et.al (1998).
3.4 Combining Grammar and Corpus
A straightforward extension of the idea of sampling the
grammar in order to create a corpus is to select those
sentences out of the grammar which make the result-
ing corpus “similar” to the corpus collected in the pi-
lot studies. In order to create this corpus, we choose
the a81 most likely sentences as determined by a language
model (a62 a67a22a68a70a69a4a71a73a72 ) built using the collected corpus. A
mixture model (a62a4a79a1a0a3a2 ) with mixture weight (a4 ) is built by
interpolating the model trained on the corpus of extracted
sentences (a62a6a5a8a7a3a9a11a10a13a12 ) and the model trained on the collected
corpus (a62a4a67 a68a70a69a4a71a73a72 ).
a59a14a5a8a7a15a9a16a10a13a12a18a17
a19a21a20a23a22a25a24a27a26a28a26a27a26a16a20a30a29
a5
a20
a0a32a31 a37
a50 a82
a56 (3)
a20
a0a6a33
a40a21a34a36a35a10a40a25a35a37a34a39a38a41a40a43a42 a40
a64a21a44a46a45a36a47a49a48a51a50
a50a52a20
a0a83a56a41a53
a62a49a79a1a0a3a2 a17 a4a55a54 a62a6a5a8a7a3a9a11a10a13a12a32a56
a50a58a57a60a59
a4 a56a23a54 a62a4a67a22a68a70a69a49a71 a72 (4)
3.5 Class-based Out-of-domain Language Model
An alternative to using in-domain corpora for building
language models is to “migrate” a corpus of a different
domain to the MATCH domain. The process of migrat-
ing a corpus involves suitably generalizing the corpus to
remove information specific only to the out-of-domain
and instantiating the generalized corpus to the MATCH
domain. Although there are a number of ways of gener-
alizing the out-of-domain corpus, the generalization we
have investigated involved identifying linguistic units,
such as noun and verb chunks in the out-of-domain cor-
pus and treating them as classes. These classes are then
instantiated to the corresponding linguistic units from the
MATCH domain. The identification of the linguistic units
in the out-of-domain corpus is done automatically using
a supertagger (Bangalore and Joshi, 1999). We use a cor-
pus collected in the context of a software helpdesk ap-
plication as an example out-of-domain corpus. In cases
where the out-of-domain corpus is closely related to the
domain at hand, a more semantically driven generaliza-
tion might be more suitable.
3.6 Adapting the SwitchBoard Language Model
We investigate the performance of a large vocabulary
conversational speech recognition system when applied
to a specific domain such as MATCH. We used the
Switchboard corpus (a59a60a10 a46a30a61 a60 ) as an example of a large vo-
cabulary conversational speech corpus. We built a tri-
gram model (a62a6a10 a46a30a61 a60 ) using the 5.4 million word corpus
and investigated the effect of adapting the Switchboard
language model given a81 in-domain untranscribed speech
utterances (a19 a35 a0
a67
a53 ). The adaptation is done by first rec-
ognizing the in-domain speech utterances and then build-
ing a language model (a62a33a77a80a60a76a77a16a62a64a63 ) from the corpus of recog-
nized text (a59 a77a80a60a76a77a16a62a64a63 ). This bootstrapping mechanism can
be used to derive an domain-specific corpus and language
model without any transcriptions. Similar techniques for
unsupervised language model adaptation are presented
in (Bacchiani and Roark, 2003; Souvignier and Kellner,
1998).
a59 a77a80a60a76a77a11a62a65a63 a17
a19a25a20 a22 a24a41a20
a48
a24a28a26a27a26a28a26a64a24a11a20 a29
a53 (5)
a20
a0 a17
a39a19a40a10a41a17a43 a39a19a45
a66
a47 a48
a50
a35
a0
a67
a51a53a36a15a51a53a37a55a51a53a38 a10
a46a30a61
a60 a56
a50a8a20
a56
3.7 Adapting a wide-coverage grammar
There have been a number of computational implemen-
tations of wide-coverage, domain-independent, syntac-
tic grammars for English in various formalisms (XTAG,
2001; Clark and Hockenmaier, 2002; Flickinger et al.,
2000). Here, we describe a method that exploits one
such grammar implementation in the Lexicalized Tree-
Adjoining Grammar (LTAG) formalism, for deriving
domain-specific corpora. An LTAG consists of a set of
elementary trees (Supertags) (Bangalore and Joshi, 1999)
each associated with a lexical item. The set of sentences
generated by an LTAG can be obtained by combining su-
pertags using substitution and adjunction operations. In
related work (Rambow et al., 2002), it has been shown
that for a restricted version of LTAG, the combinations
of a set of supertags can be represented as an FSM. This
FSM compactly encodes the set of sentences generated
by an LTAG grammar.
We derive a domain-specific corpus by constructing
a lexicon consisting of pairings of words with their su-
pertags that are relevant to that domain. We then com-
pile the grammar to build an FSM of all sentences upto a
given length. We sample this FSM and build a language
model as discussed in Section 3.3. Given untranscribed
utterances from a specific domain, we can also adapt the
language model as discussed in Section 3.6.
4 Robust Multimodal Understanding
The grammar-based interpreter uses composition oper-
ation on FSTs to transduce multimodal strings (ges-
ture,speech) to an interpretation. The set of speech strings
that can be assigned an interpretation are exactly those
that are represented in the grammar. It is to be expected
that the accuracy of meaning representation will be rea-
sonable, if the user’s input matches one of the multimodal
strings encoded in the grammar. But for those user inputs
that are not encoded in the grammar, the system will not
return a meaning representation. In order to improve the
usability of the system, we expect it to produce a (partial)
meaning representation, irrespective of the grammatical-
ity of the user’s input and the coverage limitations of the
grammar. It is this aspect that we refer to as robustness in
understanding. We present below two approaches to ro-
bust multimodal understanding that we have developed.
4.1 Pattern Matching Approach
In order to overcome the possible mismatch between
the user’s input and the language encoded in the multi-
modal grammar (a62a21a74 ), we use an edit-distance based pat-
tern matching algorithm to coerce the set of strings (a67 )
encoded in the lattice resulting from ASR (a62a6a68 ) to match
one of the strings that can be assigned an interpretation.
The edit operations (insertion, deletion, substitution) can
either be word-based or phone-based and are associated
with a cost. These costs can be tuned based on the
word/phone confusions present in the domain. The edit
operations are encoded as an transducer (a62 a12 a60a41a0a69a63 ) as shown
in Figure 5 and can apply to both one-best and lattice out-
put of the recognizer. We are interested in the string with
the least number of edits (a39a19a40a10a41a17a43a71a70a73a72 ) that can be assigned
an interpretation by the grammar. This can be achieved
by composition (a51 ) of transducers followed by a search
for the least cost path through a weighted transducer as
shown below.
a0a2a1
a17
a39a19a40a10a41a19a43a55a70a73a72
a10a4a3 a68
a62 a68 a51 a62a6a12 a60a41a0a69a63 a51 a62a21a74 (6)
wjiw : /scost
iw : /0wi
i
w
:
ε
/dcost
i
w
:
ε
/icost
Figure 5: Edit transducer with insertion, deletion, sub-
stitution and identity arcs. a34 a0 and a34a6a5 could be words or
phones. The costs on the arcs are set up such that scost
< icost + dcost.
This approach is akin to example-based techniques
used in other areas of NLP such as machine translation.
In our case, the set of examples (encoded by the gram-
mar) is represented as a finite-state machine.
4.2 Classification-based Approach
A second approach is to view robust multimodal under-
standing as a sequence of classification problems in or-
der to determine the predicate and arguments of an ut-
terance. The meaning representation shown in (1) con-
sists of an predicate (the command attribute) and a se-
quence of one or more argument attributes which are the
parameters for the successful interpretation of the user’s
intent. For example, in (1), a0a80a1a4a3a6a5a21a7a42a9a8a11a13a12 is the predicate
and a14a17a16a32a18a21a20 a5a25a18a19a24a21a12a28a9a21a20a65a12a28a27a21a29a17a20a8a0a23a14 a5a8a30a28a20a13a31a19a20a21a0 a14a33a7a25a12a28a9 is the set of ar-
guments to the predicate.
We determine the predicate (a7 a1 ) for a a8 token multi-
modal utterance (a20a10a9a22 ) by maximizing the posterior prob-
ability as shown in Equation 7.
a7
a1
a17
a39a19a40a10a41a19a43a44a39a19a45
a5
a42 a40
a50
a7 a5
a20 a9
a22
a56 (7)
We view the problem of identifying and extracting ar-
guments from a multimodal input as a problem of asso-
ciating each token of the input with a specific tag that
encodes the label of the argument and the span of the ar-
gument. These tags are drawn from a tagset which is con-
structed by extending each argument label by three addi-
tional symbols a11
a24
a35
a24a4a12 , following (Ramshaw and Mar-
cus, 1995). These symbols correspond to cases when a
token is inside (a11 ) an argument span, outside (a35 ) an ar-
gument span or at the boundary of two argument spans
(a12 ) (See Table 1).
Given this encoding, the problem of extracting the ar-
guments is a search for the most likely sequence of tags
(a13 a1 ) given the input multimodal utterance a20a14a9a22 as shown
in Equation (8). We approximate the posterior proba-
bility a42 a40 a50 a13 a5
a20a15a9
a22
a56 using independence assumptions as
User cheap thai upper west side
Utterance
Argument a2 pricea3 cheap a2 /pricea3 a2 cuisinea3
Annotation thai a2 /cuisinea3 a2 placea3 upper west
side a2 /placea3
IOB cheap pricea2 Ba3 thai cuisinea2 Ba3
Encoding upper placea2 Ia3 west placea2 Ia3
side placea2 Ia3
Table 1: The a19 I,O,Ba53 encoding for argument extraction.
shown in Equation (9).
a13
a1
a17
a39a19a40a42a41a17a43a44a39a32a45
a69
a42 a40
a50
a13 a5
a20 a9
a22
a56 (8)
a16 a39a19a40a42a41a17a43a44a39a32a45
a69
a17
a0
a42 a40
a50a19a18
a0 a5
a20
a0
a0a19a20a22a21
a24a11a20
a0a24a23a25a21a26a23
a22
a0a24a23
a22
a24a27a18
a0a19a20
a22a21a24a28a18
a0a29a20
a48
a56 (9)
Owing to the large set of features that are used for
predicate identification and argument extraction, we es-
timate the probabilities using a classification model. In
particular, we use the Adaboost classifier (Freund and
Schapire, 1996) wherein a highly accurate classifier is
build by combining many “weak” or “simple” base classi-
fiers a30a25a0 , each of which may only be moderately accurate.
The selection of the weak classifiers proceeds iteratively
picking the weak classifier that correctly classifies the ex-
amples that are misclassified by the previously selected
weak classifiers. Each weak classifier is associated with
a weight (a34a60a0 ) that reflects its contribution towards mini-
mizing the classification error. The posterior probability
of a42 a40 a50 a7 a5 a45 a56 is computed as in Equation 10.
a42 a40
a50
a7 a5
a45
a56 a17
a57
a50a13a57
a56
a35 a20
a48
a1a32a31a34a33
a46
a33
a1a4a35
a33a37a36 a2a39a38
a56
(10)
It should be noted that the data for training the clas-
sifiers can be collected from the domain or derived from
an in-domain grammar using techniques similar to those
presented in Section 3.
5 Experiments and Results
We describe a set of experiments to evaluate the perfor-
mance of the speech recognizer and the concept accu-
racy of speech only and speech and gesture exchanges in
our MATCH multimodal system. We use word accuracy
and string accuracy for evaluating ASR output. All re-
sults presented in this section are based on 10-fold cross-
validation experiments run on the 709 spoken and multi-
modal exchanges collected from the pilot study described
in Section 3.1.
5.1 Language Model
Table 2 presents the performance results for ASR word
and sentence accuracy using language models trained on
collected in-domain corpus as well as on corpora derived
using the different methods discussed in Section 3. For
the class-based models mentioned in the table, we defined
different classes based on areas of interest (eg. riverside
park, turtle pond), points of interest (eg. Ellis Island,
United Nations Building), type of cuisine (eg. Afghani,
Scenario ASR Word Accuracy Sentence Accuracy
Grammar Based Grammar as Language Model 41.6 38.0
Class-based N-gram Language Model 60.6 42.9
In-domain Data Class-based N-gram Model 73.8 57.1
Grammar+In-domain Data Class-based N-gram Model 75.0 59.5
Out-of-domain N-gram Model 17.6 17.5
Class-based N-gram Model 58.4 38.8
Class-based N-gram Model
with Grammar-based N-gram
Language Model 64.0 45.4
SwitchBoard N-gram Model 43.5 25.0
Language model trained on
recognized in-domain data 55.7 36.3
Wide-coverage N-gram Model 43.7 24.8
Grammar Language model trained on
recognized in-domain data 55.8 36.2
Table 2: Performance results for ASR Word and Sentence accuracy using models trained on data derived from different
methods of bootstrapping domain-specific data.
Indonesian), price categories (eg. moderately priced, ex-
pensive), and neighborhoods (eg. Upper East Side, Chi-
natown).
It is immediately apparent that the hand-crafted gram-
mar as language model performs poorly and a language
model trained on the collected domain-specific corpus
performs significantly better than models trained on de-
rived data. However, it is encouraging to note that a
model trained on a derived corpus (obtained from com-
bining migrated out-of-domain corpus and a corpus cre-
ated by sampling in-domain grammar) is within 10%
word accuracy as compared to the model trained on the
collected corpus. There are several other noteworthy ob-
servations from these experiments.
The performance of the language model trained on data
sampled from the grammar is dramatically better as com-
pared to the performance of the hand-crafted grammar.
This technique provides a promising direction for author-
ing portable grammars that can be sampled subsequently
to build robust language models when no in-domain cor-
pora are available. Furthermore, combining grammar and
in-domain data as described in Section 3.4, outperforms
all other models significantly.
For the experiment on migration of out-of-domain cor-
pus, we used a corpus from a software helpdesk appli-
cation. Table 2 shows that the migration of data using
linguistic units as described in Section 3.5 significantly
outperforms a model trained only on the out-of-domain
corpus. Also, combining the grammar sampled corpus
with the migrated corpus provides a further improvement.
The performance of the SwitchBoard model on the
MATCH domain is presented in Table 2. We built a tri-
gram model using a 5.4 million word SwitchBoard cor-
pus and investigated the effect of adapting the resulting
language model on in-domain untranscribed speech ut-
terances. The adaptation is done by first recognizing the
training partition of the in-domain speech utterances and
then building a language model from the recognized text.
We observe that although the performance of the Switch-
Board language model on the MATCH domain is poorer
than the performance of a model obtained by migrating
data from a related domain, the performance can be sig-
nificantly improved using the adaptation technique.
The last row of Table 2 shows the results of using
the MATCH specific lexicon to generate a corpus us-
ing a wide-coverage grammar, training a language model
and adapting the resulting model using in-domain untran-
scribed speech utterances as was done for the Switch-
Board model. The class-based trigram model was built
using 500,000 randomly sampled paths from the network
constructed by the procedure described in Section 3.7.
5.2 Multimodal Understanding
In this section, we present results on multimodal under-
standing using the two techniques presented in Section 4.
We use concept token accuracy and concept string accu-
racy as evaluation metrics for the entire meaning repre-
sentation in these experiments. These metrics correspond
to the word accuracy and string accuracy metrics used for
ASR evaluation. In order to provide a finer-grained eval-
uation, we breakdown the concept accuracy in terms of
the accuracy of identifying the predicates and arguments.
Again, we use string accuracy metrics to evaluate pred-
icate and argument accuracy. We use the output of the
ASR with the language model trained on the collected
data (word accuracy of 73.8%) as the input to the under-
standing component.
The grammar-based multimodal understanding system
composes the input multimodal string with the multi-
modal grammar represented as an FST to produce an in-
terpretation. Thus an interpretation can be assigned to
only those multimodal strings that are encoded in the
grammar. However, the result of ASR and gesture recog-
nition may not be one of the strings encoded in the gram-
mar, and such strings are not assigned an interpretation.
This fact is reflected in the low concept string accuracy
Predicate String Argument String Concept Token Concept String
Accuracy(%) Accuracy(%) Accuracy(%) Accuracy(%)
Baseline 65.2 52.1 53.5 45.2
Word-based Pattern-Matching 73.7 62.4 68.1 59.0
Phone-based Pattern-Matching 73.7 63.8 67.8 61.3
Classification-based 84.1 59.1 73.5 56.4
Table 3: Performance results of robust multimodal understanding
for the baseline as shown in Table 3.
The pattern-matching based robust understanding ap-
proach mediates the mismatch between the strings that
are output by ASR and the strings that can be assigned an
interpretation. We experimented with word based pattern
matching as well as phone based pattern matching on the
one-best output of the recognizer. As shown in Table 3,
the pattern-matching robust understanding approach im-
proves the concept accuracy over the baseline signifi-
cantly. Furthermore, the phone-based matching method
has a similar performace to the word-based matching
method.
For the classification-based approach to robust under-
standing we used a total of 10 predicates such as help, as-
sert, inforequest, and 20 argument types such as cuisine,
price, location . We use unigrams, bigrams and trigrams
appearing in the multimodal utterance as weak classifiers
for the purpose of predicate classification. In order to
predict the tag of a word for argument extraction, we use
the left and right trigram context and the tags for the pre-
ceding two tokens as weak classifiers. The results are
presented in Table 3.
Both the approaches to robust understanding outper-
form the baseline model significantly. However, it is in-
teresting to note that while the pattern-matching based
approach has a better argument extraction accuracy, the
classification based approach has a better predicate iden-
tification accuracy. Two possible reasons for this are:
first, argument extraction requires more non-local infor-
mation that is available in the pattern-matching based ap-
proach while the classification-based approach relies on
local information and is more conducive for identifying
the simple predicates in MATCH. Second, the pattern-
matching approach uses the entire grammar as a model
for matching while the classification approach is trained
on the training data which is significantly smaller when
compared to the number of examples encoded in the
grammar.
6 Discussion
Although we are not aware of any attempts to address
the issue of robust understanding in the context of multi-
modal systems, this issue has been of great interest in the
context of speech-only conversational systems (Dowd-
ing et al., 1993; Seneff, 1992; Allen et al., 2000; Lavie,
1996). The output of the recognizer in these systems usu-
ally is parsed using a handcrafted grammar that assigns
a meaning representation suited for the downstream dia-
log component. The coverage problems of the grammar
and parsing of extra-grammatical utterances is typically
addressed by retrieving fragments from the parse chart
and incorporating operations that combine fragments to
derive a meaning of the recognized utterance. We have
presented an approach that achieves robust multimodal
utterance understanding using the edit-distance automa-
ton in a finite-state-based interpreter without the need for
combining fragments from a parser.
The issue of combining rule-based and data-driven ap-
proaches has received less attention, with the exception
of a few (Wang et al., 2000; Rayner and Hockey, 2003;
Wang and Acero, 2003). In a recent paper (Rayner and
Hockey, 2003), the authors address this issue by em-
ploying a decision-list-based speech understanding sys-
tem as a means of progressing from rule-based models
to data-driven models when data becomes available. The
decision-list-based understanding system also provides a
method for robust understanding. In contrast, the ap-
proach presented in this paper can be used on lattices of
speech and gestures to produce a lattice of meaning rep-
resentations.
7 Conclusion
In this paper, we have addressed how to rapidly proto-
type multimodal conversational systems without relying
on the collection of domain-specific corpora. We have
presented several techniques that exploit domain-specific
grammars, reuse out-of-domain corpora and adapt large
conversational corpora and wide-coverage grammars to
derive a domain-specific corpus. We have demonstrated
that a language model trained on a derived corpus per-
forms within 10% word accuracy of a language model
trained on collected domain-specific corpus, suggest-
ing a method of building an initial language model
without having to collect domain-specific corpora. We
have also presented and evaluated pattern-matching and
classification-based approaches to improve the robust-
ness of multimodal understanding. We have presented re-
sults for these approaches in the context of a multimodal
city guide application (MATCH).
8 Acknowledgments
We thank Patrick Ehlen, Amanda Stent, Helen Hastie,
Candy Kamm, Marilyn Walker, and Steve Whittaker for
their contributions to the MATCH system. We also thank
Allen Gorin, Mazin Rahim, Giuseppe Riccardi, and Juer-
gen Schroeter for their comments on earlier versions of
this paper.
References
J. Allen, D. Byron, M. Dzikovska, G. Ferguson,
L. Galescu, and A. Stent. 2000. An architecture for
a generic dialogue shell. JNLE, 6(3).
M. Bacchiani and B. Roark. 2003. Unsupervised lan-
guage model adaptation. In In Proc. Int. Conf. Acous-
tic,Speech,Signal Processing.
S. Bangalore and M. Johnston. 2000. Tight-coupling of
multimodal language processing with speech recogni-
tion. In Proceedings of ICSLP, Beijing, China.
S. Bangalore and A. K. Joshi. 1999. Supertagging: An
approach to almost parsing. Computational Linguis-
tics, 25(2).
M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou, and
A. Syrdal. 1999. The AT&T next-generation TTS. In
In Joint Meeting of ASA; EAA and DAGA.
M. Boros, W. Eckert, F. Gallwitz, G. G˘orz, G. Hanrieder,
and H. Niemann. 1996. Towards Understanding Spon-
taneous Speech: Word Accuracy vs. Concept Accu-
racy. In Proceedings of ICSLP, Philadelphia.
Stephen Clark and Julia Hockenmaier. 2002. Evaluating
a wide-coverage CCG parser. In Proceedings of the
LREC 2002 Beyond Parseval Workshop, Las Palmas,
Spain.
J. Dowding, J. M. Gawron, D. E. Appelt, J. Bear,
L. Cherny, R. Moore, and D. B. Moran. 1993. GEM-
INI: A natural language system for spoken-language
understanding. In Proceedings of ACL, pages 54–61.
D. Flickinger, A. Copestake, and I. Sag. 2000. Hpsg
analysis of english. In W. Wahlster, editor, Verbmobil:
Foundations of Speech-to-Speech Translation, pages
254–263. Springer–Verlag, Berlin, Heidelberg, New
York.
Y. Freund and R. E. Schapire. 1996. Experiments with
a new boosting alogrithm. In Machine Learning: Pro-
ceedings of the Thirteenth International Conference,
pages 148–156.
L. Galescu, E. K. Ringger, and J. F. Allen. 1998. Rapid
language model development for new task domains. In
Proceedings of the ELRA First International Confer-
ence on Language Resources and Evaluation (LREC),
Granada, Spain.
M. Johnston and S. Bangalore. 2000. Finite-state mul-
timodal parsing and understanding. In Proceedings of
COLING, Saarbr¨ucken, Germany.
M. Johnston, S. Bangalore, A. Stent, G. Vasireddy, and
P. Ehlen. 2002a. Multimodal language processing for
mobile information access. In In Proceedings of IC-
SLP, Denver, CO.
M. Johnston, S. Bangalore, G. Vasireddy, A. Stent,
P. Ehlen, M. Walker, S. Whittaker, and P. Maloor.
2002b. MATCH: An architecture for multimodal di-
alog systems. In Proceedings of ACL, Philadelphia.
A. Lavie. 1996. GLR*: A Robust Grammar-Focused
Parser for Spontaneously Spoken Language. Ph.D.
thesis, Carnegie Mellon University.
M-J. Nederhof. 1997. Regular approximations of CFLs:
A grammatical view. In Proceedings of the Interna-
tional Workshop on Parsing Technology, Boston.
Fernando C.N. Pereira and Michael D. Riley. 1997.
Speech recognition by composition of weighted finite
automata. In E. Roche and Schabes Y., editors, Finite
State Devices for Natural Language Processing, pages
431–456. MIT Press, Cambridge, Massachusetts.
Owen Rambow, Srinivas Bangalore, Tahir Butt, Alexis
Nasr, and Richard Sproat. 2002. Creating a finite-
state parser with application semantics. In In Proceed-
ings of the 19th International Conference on Compu-
tational Linguistics (COLING 2002), Taipei, Taiwan.
Lance Ramshaw and Mitchell P. Marcus. 1995. Text
chunking using transformation-based learning. In Pro-
ceedings of the Third Workshop on Very Large Cor-
pora, MIT, Cambridge, Boston.
M. Rayner and B. A. Hockey. 2003. Transparent com-
bination of rule-based and data-driven approaches in
speech understanding. In In Proceedings of the EACL
2003.
S. Seneff. 1992. A relaxation method for understand-
ing spontaneous speech utterances. In Proceedings,
Speech and Natural Language Workshop, San Mateo,
CA.
R.D. Sharp, E. Bocchieri, C. Castillo, S. Parthasarathy,
C. Rath, M. Riley, and J.Rowland. 1997. The Wat-
son speech recognition engine. In In Proceedings of
ICASSP, pages 4065–4068.
B. Souvignier and A. Kellner. 1998. Online adaptation
for language models in spoken dialogue systems. In
Int. Conference on Spoken Language Processing (IC-
SLP).
Y. Wang and A. Acero. 2003. Combination of cfg and
n-gram modeling in semantic grammar learning. In
In Proceedings of the Eurospeech Conference, Geneva,
Switzerland.
Y.Y. Wang, M. Mahajan, and X. Huang. 2000. Unified
Context-Free Grammar and N-Gram Model for Spo-
ken Language Processing. In Proceedings of ICASSP.
XTAG. 2001. A lexicalized tree-adjoining grammar for
english. Technical report, University of Pennsylvania,
http://www.cis.upenn.edu/ xtag/gramrelease.html.
