Automatic Construction of Frame Representations 
for Spontaneous Speech in Unrestricted Domains 
Klaus Zechner 
Language Technologies Institute 
Carnegie Mellon University 
5000 Forbes Avenue 
Pittsburgh, PA 15213, USA 
zechner@cs, cmu. edu 
Abstract 
This paper presents a system which automatically 
generates shallow semantic frame structures for con- 
versational speech in unrestricted domains. 
We argue that such shallow semantic representations 
can indeed be generated with a minimum amount of 
linguistic knowledge engineering and without having 
to explicitly construct a semantic knowledge base. 
The system is designed to be robust to deal with the 
problems of speech dysfluencies, ungrammaticalities, 
and imperfect speech recognition. 
Initial results on speech transcripts are promising 
in that correct mappings could be identified in 21% 
of the clauses of a test set (resp. 44% of this test 
set where ungrammatical or verb-less clauses were 
removed). 
1 Introduction 
In syntactic and semantic analysis of spontaneous 
speech, little research has been done with regard 
to dealing with language in unrestricted domains. 
There are several reasons why so far an in-depth 
analysis of this type of language data has been con- 
sidered prohibitively hard: 
• inherent properties of spontaneous speech, such 
as dysfiuencies and ungrammaticalities (Lavie, 
1996) 
• word accuracy being far from perfect (e.g., on a 
typical corpus such as SWITCHBOARD (SWBD) 
(Godfrey et al., 1992), current state-of-the-art 
recognizers have word error rates in the range 
of 30-40% (Finke et al., 1997)) 
• if the domain is unrestricted, manual construc- 
tion of a semantic knowledge base with reason- 
able coverage is very labor intensive 
In this paper we propose to combine methods of 
partial parsing ("chunking") with the mapping of 
the verb arguments onto subcategorization frames 
that can be extracted automatically, in this case, 
from WordNet (Miller et al., 1993). As prelimi- 
nary results indicate, this yields a way of generating 
shallow semantic representations efficiently and with 
minimal manual effort. 
Eventually, these semantic structures can serve as 
(additional) input to a variety of different tasks in 
NLP, such as text or dialogue summarization, in- 
formation gisting, information retrieval, or shallow 
machine translation. 
2 Shallow Semantic Structures 
The two main representations we are building on are 
the following: 
• chunks: these correspond mostly to basic (i.e., 
non-attached) phrasal constituents 
• frames: these are built from the parsed chunks 
according to subcategorization constraints ex- 
tracted from the WordNet lexicon 
The chunks are defined in a similar way as in (Ab- 
ney, 1996), namely as "non-recursive phrasal units"; 
they roughly correspond to the standard linguistic 
notion of constituents, except that there are no at- 
tachments made (e.g., a PP to a NP) and that a ver- 
bal chunk does not include any of its arguments but 
just consists of the verbal complex (auxiliary/main 
verb), including possibly inserted adverbs and/or 
negation particles. 
All frames are being generated on the basis of 
"short clauses" which we define as minimal clausal 
units that contain at least one subject and an in- 
flected verbal form) 2 
To produce the list of all possible subcategoriza- 
tion frames, we first extracted all verbal tokens from 
the tagged SWITCHBOARD corpus and then retrieved 
the frames from WordNet. Table 1 provides a sum- 
mary of this pre-calculation. 
1This means in effect that relative clauses will get mapped 
separately. They will, however, have to be "linked" to the 
phrase they modify. 
2We are also considering to take even shorter units as basis 
for the mapping that would, e.g., include non-inflected clausal 
complements. The most convenient solution has yet to be 
determined. 
1448 
Verbal tokens 
Different lemmata 
Senses in all lemmata 
Avg. senses per lemma 
Total number of frames 
Avg. frames per sense 
4428 
2464 
8523 
3.46 
15467 
1.81 
Table 1: WordNet: verbal lemmata, senses, 
and frames 
3 Resources and System 
Components 
We use the following resources to build our system: 
• the SWITCHBOARD (SWBD) corpus (Godfrey et 
al., 1992) for speech data, transcripts, and 
annotations at various levels (e.g., for segment 
boundaries or parts of speech) 
• the JANUS speech recognizer (Waibel et al., 
1996) to provide us with input hypotheses 
• a part of speech (POS) tagger, derived from 
(Brill, 1994), adapted to and retrained for the 
SWITCHBOARD corpus 
• a preprocessing pipe which cleans up speech 
dysfluencies (e.g., repetitions, hesitations) and 
contains a segmentation module to split "the 
speech recognizer turns into short clauses 
• a chart parser (Ward, 1991) with a POS based 
grammar to generate the chunks 3 (phrasal con- 
stituents) 
• WordNet 1.5 (Miller et al., 1993) for the extrac- 
tion of subcategorization (subcat) frames for all 
senses of a verb (including semantic features, 
such as "animacy') 
• a mapper which tries to find the "best match" 
between the chunks found within a short clause 
and the subcat frames for the main verb in that 
clause 
The major blocks of the system architecture are 
depicted in Figure I. 
We want to stress here that except for the devel- 
opment of the small POS grammar and the frame- 
mapper, the other components and resources were 
already present or quite simple to implement. There 
has also been significant work on (semi-)automatic 
induction of subcategorization frames (Manning, 
1993; Briscoe and Carroll, 1997), such that even 
3More details about the chunk parser can be found in 
(Zechner, 1997). 
input urerance 
speech recognizer 
hypothesis L 
I POS tagger 
, 
prepro e. ,ngp p  II 
II; chun par er II 
chunk sequence 
li II 
frame representation 
Figure 1: Global system architecture 
without the important knowledge source from Word- 
Net, a similar system could be built for other lan- 
guages as well. Also, the Euro-WordNet project 
(Vossen et al., 1997) is currently underway in build- 
ing WordNet resources for other European lan- 
guages. 
4 Preliminary Experiments 
We performed some initial experiments using the 
SWBD transcripts as input to the system. These 
were POS tagged, preprocessed, segmented into 
short clauses, parsed in chunks using a POS 
based grammar, and finally, for each short clause, 
the frame-mapper matched all potential arguments 
of the verb against all possible subcategorization 
frames listed in the lemmata file we had precom- 
puted from WordNet (see section 2). 
In total we had over 600000 short clauses, con- 
taining approximately 1.7 million chunks. Only 18 
different chunk patterns accounted for about half 
of these short clauses. Table 2 shows these chunk 
1449 
main verb frequency chunk sequence 
present? 
no 
no 
no 
yes 
yes 
no 
no 
yes 
yes 
83353 
36731 
33182 
29749 
19176 
13834 
13623 
12220 
11038 
(noises/hesit.) 
aft 
conj 
np vb 
np vb np 
np 
conj np 
conj np vb 
conj np vb np 
yes 
yes 
yes 
no 
yes 
no 
yes 
yes 
yes 
7649 
7092 
5552 
5044 
4926 
4079 
3999 
3998 
3996 
np vb adjp 
np vb pp 
np vbneg 
advp 
np vb np pp 
PP 
conj np vb pp 
conj np vb adjp 
np vb advp 
Table 2: Most frequent chunk sequences in 
short clauses 
patterns and their frequencies. 4 Most of these con- 
tain main verbs and hence can be sensibly used 
in a mapping procedure but some of them (e.g., 
aff, con j, advp) do not. These are typically back- 
channellings, adverbial comments, and colloquial 
forms (e.g., "yeah", "and...", "oh really"). They can 
be easily dealt with a preprocessing module that as- 
signs them to one of these categories and does not 
send them to the mapper. 
Another interesting observation we make here is 
that within these most common chunk patterns, 
there is only one pattern (np vb np pp) which could 
lead to a potential PP-attachment ambiguity. We 
conjecture that this is most probably due to the na- 
ture of conversational speech which, unlike for writ- 
ten (and more formal) language, does not make too 
frequent use of complex noun phrases that have one 
or multiple prepositional phrases attached to them. 
We selected 98 short clauses randomly from the 
output to perform a first error analysis. 
The results are summarized in Table 3. In over 
21% of the clauses, the mapper finds at least one 
mapping that is correct. Another 23.5% of the 
clauses do not contain any chunks that are worth 
to be mapped in the first place (noises, hesitations), 
4 Chunk abbreviations: conj=conjunction, aft=affirmative, 
np=noun phrase, vb=verbai chunk, vbneg=negated ver- 
bal chunk, adjp=adjectival phrase, advp=adverbial phrase, 
pp=prepositional phrase. 
so these could be filtered out and dealt with entirely 
before the mapping process takes place, as we men- 
tioned earlier. 28.6% of the clauses are in some sense 
incomplete, mostly they are lacking a main verb 
which is the crucial element to get the mapping pro- 
cedure started. We regard these as "hard" residues, 
including well-known linguistic problems such as el- 
lipsis, in addition to some spoken language ungram- 
maticalities. The last two categories (26.6% com- 
bined) in the table are due to the incompleteness and 
inaccuracies of the system components themselves. 
To illustrate the process of mapping, we shall 
present an example here, starting from the 
POS-tagged utterance up to the semantic frame 
representation:5 s 
short clause, annotated with POS: 
i/PRP wi11/AUX talk/VB 
to/PREP you/PRPAagain/RB 
LEMMA/token (of main verb): 
talk/talk 
parsed chunks: 
-np-vb-pp-advp 
parsed sequence to map: 
-NP-VBZ-PP 
WordNet frames: 
:I-INAN-VBZ:I-ANIM-VBZ:I-INAN-IS-VBG-PP 
:I-ANIM-VBZ-PP:I-ANIM-VBZ-TO-ANIM 
:2-ANIM-VBZ:2-ANIM-VBZ-PP 
:3-ANIM-VBZ:3-ANIM-VBZ-INAN 
:4-ANIM-VBZ 
:5-ANIM-VBZ 
:6-ANIM-VBZ:6-INAN-VBZ-TO-ANIM 
:6-ANIM-VBZ-ON-INAN 
Potential mappings (found by mapper): 
map. I: I-NP-VBZ (I-INAN-VBZ) 
map. 2: 1-NP-VBZ (I-ANIM-VBZ) 
map. 3: I-NP-VBZ-PP (1-ANIM-VBZ-PP) 
map. 4: I-NP-VBZ-PP (1-ANIM-VBZ-TO-ANIM) 
(...) 
Frame representation (for mapping 4): 
\[agent/an\] (i/PRP) 
5PO$ abbreviations: PRP=personal pro- 
noun, AUX=auxiliary verb, VB=main verb (non-inflected), 
PREP=prepositlon. PRPA-personal pronoun in accusative, 
RB=adverb. 
°Frame abbreviations: 
INAN=inanimate NP, ANIM=animate NP, VBZ--inflected 
main verb, IS=is, VBG=gerund, PP=prepositional phrase, 
TO=to (prep.), ONmon (prep.). 
1450 
classification 
correct 
non-mappable 
ungrammatical 
preprocessing 
mapper 
occ. (%) 
21 (21.4%) 
23 (23.5%) 
28 (28.6%) 
13 (13.3%) 
13 (13.3%) 
Comment 
at least one reasonable mapping is found 
clause consists of noises/hesitations only 
e.g., incomplete phrase, no verb 
problem is caused by errors in POS tagger/segmenter/parser 
problem due to incompleteness of mapper 
Table 3: Summary of classification results for mapper output 
\[pred\] (\[vb_fin\] (\[aux\] (wilI/AUX) 
\[head\] (talk/VB)) 
\[pp_obj\] ( \[prep\] (to/PREP) 
\[theme/an\] (you/PRPA))) 
\[modif\] (again/RB) 
Since chunks like advp or conj are not part of the 
WordNet frames, we remove these from the parsed 
chunk sequence, before a mapping attempt is being 
made. 7 
In our example, WordNet yields 14 frames for 6 
senses of the main verb talk. The mapper already 
finds a "perfect match "s for the first, i.e., the most 
frequent sense 9 of the verb (mapping 4 can be es- 
timated to be more accurate than mapping 3 since 
also the preposition matches to the input string). 
This will be also the default sense to choose, unless 
there is a word sense disambiguating module avail- 
able that strongly favors a less frequent sense. 
Since WordNet 1.5 does not provide detailed 
semantic frame information but only general 
subcategorization with extensions such as "ani- 
mate/inanimate", we plan to extend this infor- 
mation by processing machine-readable dictionaries 
which provide a richer set of semantic role informa- 
tion of verbal heads, l° 
It is interesting to see that even at this early stage 
of our project the results of this shallow analysis are 
quite encouraging. If we remove those clauses from 
the test set which either should not or cannot be 
mapped in the first place (because they are either 
not containing any structure ("non-mappable") or 
are ungrammatical), the remainder of 47 clauses al- 
ready has a success-rate of 44.7%. Improvements of 
the system components before the mapping stage as 
well as to the mapper itself will further increase the 
mapping performance. 
7These chunks can be easily added to the mapper's output 
again, as shown in the example. 
Spartial matches, such as mappings I and 2 in this exam- 
ple, are allowed but disfavored to perfect matches. 
9In WordNet 1.5, the first sense is also supposed to be the 
most frequent one. 
l°The "agent" and "theme" assignments are currently just 
defaults for these types of subcat frames. 
5 Future Work 
It is obvious from our evaluation, that most core 
components, specifically the mapper need to be im- 
proved and refined. As for the mapper, there are 
issues of constituent coordination, split verbs, infini- 
tival complements, that need to be addressed and 
properly handled. Also, the "linkage" between main 
and relative clauses has to be performed such that 
this information is maintained and not lost due to 
the segmentation into short clauses. 
Experiments with speech recognizer output in- 
stead of transcripts will show in how far we still get 
reasonable frame representations when we are faced 
with erroneous input in the first place. Specifically, 
since the mapper relies on the identification of the 
"head verb", it will be crucial that at least those 
words are correctly recognized and tagged most of 
the time. 
To further enhance our representation, we could 
use speech act tags, generated by an automatic 
speech act classifier (Finke et al., 1998) and attach 
these to the short clauses. 11 
6 Summary 
We have presented a system which is able to build 
shallow semantic representations for spontaneous 
speech in unrestricted domains, without the neces- 
sity of extensive knowledge engineering. 
Initial experiments demonstrate that this ap- 
proach is feasible in principle. However, more work 
to improve the major components is needed to reach 
a more reliable and valid output. 
The potentials of this approach for NLP applica- 
tions that use speech as their input are obvious: se- 
mantic representations can enhance almost all tasks 
that so far have either been restricted to narrow do- 
mains or were mainly using word-level representa- 
tions, such as text summarization, information re- 
trieval, or shallow machine translation. 
11 Sometimes, the speech acts will span more than one short 
clause but as long as the turn-boundaries are fixed for both 
our system and the speech act classifier, the re-combination 
of short clauses can be done straightforwardly. 
1451 
7" Acknowledgements 
The author wants to thank Marsal Gavaldh, Mirella 
Lapata, and the three anonymous reviewers for valu- 
able comments on this paper. 
This work was funded in part by grants of the 
Verbmobil project of the Federal Republic of Ger- 
many, ATR - Interpreting Telecommunications Re- 
search Laboratories of Japan, and the US Depart- 
meat of Defense. 

References 
Steven Abney. 1996. Partial parsing via finite-state 
cascades. In Workshop on Robust Parsing, 8th 
European Summer School in Logic, Language and 
Information, Prague, Czech Republic, pages 8-15. 
Eric Brill. 1994. Some advances in transformation- 
based part of speech tagging. In Proceeedings of 
AAAI-94. 
Ted Briscoe and John Carroll. 1997. Automatic 
extraction of subcategorization from corpora. In 
Proceedings of the 5th ANLP Conference, Wash- 
ington DC, pages 24-29. 
Michael Finke, Jiirgen Fritsch, Petra Geutner, Klans 
Ries and Torsten Zeppenfeld. 1997. The Janus- 
RTk SWITCHBOARD/CALLHOME 1997 Evaluation 
System. In Proceedings of LVCSR HubS-e Work- 
shop, May 13-15, Baltimore, Maryland. 
Michael Finke, Maria Lapata, Alon Lavie, Lori 
Levin, Laura Mayfield Tomokiyo, Thomas Polzin, 
Klaus Ries, Alex Waibel and Klaus Zechner. 1998. 
CLARITY: Inferring Discourse Structure from 
Speech. In Proceedings of the AAAI 98 Spring 
Symposium: Applying Machine Learning to Dis- 
course Processing, Stanford, CA, pages 25-32 
J. J. Godfrey, E. C. Holliman, and J. McDaniel. 
1992. SWITCHBOARD: telephone speech corpus 
for research and development. In Proceedings of 
the ICASSP-9e, volume 1, pages 517-520. 
Alon Lavie. 1996. GLR*: A Robust Grammar. 
Focused Parser for Spontaneously Spoken Lan- 
guage. Ph.D. thesis, Carnegie Mellon University, 
Pittsburgh, PA. 
Christopher D. Manning. 1993. Automatic acquisi- 
tion of a large subcategorization dictionary from 
corpora. In Proceeedings of the 31th Annual Meet- 
ing of the ACL, pages 235-242. 
George A. Miller, Richard Beckwith, Christiane 
Fellbaum, Derek Gross, and Katherine Miller. 
1993. Five papers on WordNet. Technical report, 
Princeton University, CSL, revised version, Au- 
gust. 
Pick Vossen, Pedro Diez-Orzas, and Wim Peters. 
1997. The Multilingual Design of EuroWordNet. 
In Proceedings of the ACL/EACL-97 workshop 
Automatic Information Extraction and Building 
of Lezical Semantic Resources for NLP Applica- 
tions, Madrid, July I2th, 1997 
Alex Waibel, Michael Pinke, Donna Gates, Marsal 
Gavaldh, Thomas Kemp, Alon Lavie, Lori Levin, 
Martin Maier, Laura Mayfield, Arthur McNair, 
Ivica P~gina, Kaori Shima, Trio Sloboda, Monika 
Woszczyna, Torsten Zeppenfeld, and Puming 
Zhan. 1996. JANUS-II - advances in speech recog- 
nition. In Proceedings of the ICASSP-96. 
Wayne Ward. 1991. Understanding spontaneous 
speech: The PHOENIX system. In Proceedings 
of ICASSP-91, pages 365-367. 
Klaus Zechner. 1997. Building chunk level rep- 
resentations for spontaneous speech in unre- 
stricted domains: The CHUNKY system and 
its application to reranking Nbest lists of a 
speech recognizer. M.S. Project lq~port, CMU, 
Department of Philosophy. Available from 
http ://www. con~rib, andrew, cmu. edu/'zechner/ 
publ icat ions. h~ml 
