Proceedings of the ACL Student Research Workshop, pages 37–42,
Ann Arbor, Michigan, June 2005. c©2005 Association for Computational Linguistics
American Sign Language Generation:  
Multimodal NLG with Multiple Linguistic Channels 
 
Matt Huenerfauth  
Computer and Information Science  
University of Pennsylvania 
Philadelphia, PA 19104 USA 
matt@huenerfauth.com 
 
 
 
Abstract 
Software to translate English text into 
American Sign Language (ASL) animation 
can improve information accessibility for 
the majority of deaf adults with limited 
English literacy.  ASL natural language 
generation (NLG) is a special form of mul-
timodal NLG that uses multiple linguistic 
output channels.  ASL NLG technology has 
applications for the generation of gesture 
animation and other communication signals 
that are not easily encoded as text strings.  
1 Introduction and Motivations 
American Sign Language (ASL) is a full natural 
language – with a linguistic structure distinct from 
English – used as the primary means of communi-
cation for approximately one half million deaf 
people in the United States (Neidle et al., 2000, 
Liddell, 2003; Mitchell, 2004).  Without aural ex-
posure to English during childhood, a majority of 
deaf U.S. high school graduates (age 18) have only 
a fourth-grade (age 10) English reading level (Holt, 
1991).  Technology for the deaf rarely addresses 
this literacy issue; so, many deaf people find it dif-
ficult to read text on electronic devices.  Software 
for translating English text into animations of a 
computer-generated character performing ASL can 
make a variety of English text sources accessible to 
the deaf, including: TV closed captioning, teletype 
telephones, and computer user-interfaces  (Huener-
fauth, 2005).  Machine translation (MT) can also 
be used in educational software for deaf children to 
help them improve their English literacy skills.   
This paper describes the design of our English-
to-ASL MT system (Huenerfauth, 2004a, 2004b, 
2005), focusing on ASL generation. This overview 
illustrates important correspondences between the 
problem of ASL natural language generation 
(NLG) and related research in Multimodal NLG.   
1.1 ASL Linguistic Issues 
In ASL, several parts of the body convey meaning 
in parallel: hands (location, orientation, shape), eye 
gaze, mouth shape, facial expression, head-tilt, and 
shoulder-tilt.  Signers may also interleave lexical 
signing (LS) with classifier predicates (CP) during 
a performance.  During LS, a signer builds ASL 
sentences by syntactically combining ASL lexical 
items (arranging individual signs into sentences).  
The signer may also associate entities under dis-
cussion with locations in space around their body; 
these locations are used in pronominal reference 
(pointing to a location) or verb agreement (aiming 
the motion path of a verb sign to/from a location).   
During CPs, signers’ hands draw a 3D scene in 
the space in front of their torso.  One could imag-
ine invisible placeholders floating in front of a 
signer representing real-world objects in a scene.  
To represent each object, the signer places his/her 
hand in a special handshape (used specifically for 
objects of that semantic type: moving vehicles, 
seated animals, upright humans, etc.).  The hand is 
moved to show a 3D location, movement path, or 
surface contour of the object being described. For 
example, to convey the English sentence “the car 
parked next to the house,” signers would indicate a 
location in space to represent the house using a 
special handshape for ‘bulky objects.’  Next, they 
would use a ‘moving vehicle’ handshape to trace a 
3D path for the car which stops next to the house. 
37
1.2 Previous ASL MT Systems 
There have been some previous English-to-ASL 
MT projects – see survey in (Huenerfauth, 2003).  
Amid other limitations, none of these systems ad-
dress how to produce the 3D locations and motion 
paths needed for CPs.  A fluent, useful English-to-
ASL MT system cannot ignore CPs.  ASL sign-
frequency studies show that signers produce a CP 
from 1 to 17 times per minute, depending on genre 
(Morford and MacFarlane, 2003).  Further, it is 
those English sentences whose ASL translation 
uses a CP that a deaf user with low English literacy 
would need an MT system to translate.  These Eng-
lish sentences look structurally different than their 
ASL CP counterpart – often making the English 
sentence difficult to read for a deaf user. 
2 ASL NLG: A Form of Multimodal NLG 
NLG researchers think of communication signals 
in a variety of ways: some as a written text, other 
as speech audio (with prosody, timing, volume, 
and intonation), and those working in Multimodal 
NLG as text/speech with coordinated graphics 
(maps, charts, diagrams, etc).  Some Multimodal 
NLG focuses on “embodied conversational agents” 
(ECAs), computer-generated animated characters 
that communicate with users using speech, eye 
gaze, facial expression, body posture, and gestures 
(Cassell et al., 2000; Kopp et al., 2004).   
The output of any NLG system could be repre-
sented as a stream of values (or features) that 
change over time during a communication signal; 
some NLG systems specify more values than oth-
ers.  Because the English writing system does not 
record a speaker’s prosody, facial expression or 
gesture1, a text-based NLG system specifies fewer 
communication stream values in its output than 
does a speech-based or ECA system.  A text-based 
NLG system requires literate users, to whom it can 
transfer some of the processing burden; they must 
mentally reconstruct more of the language per-
formance than do users of speech or ECA systems. 
Since most writing systems are based on strings, 
text-based NLG systems can easily encode their 
output as a single stream, namely a sequence of 
                                                        
1 Some punctuation marks loosely correspond to intonation or 
pauses, but most prosodic information is lost.  Facial expres-
sion and gesture is generally not conveyed in writing, except 
perhaps for the occasional use of “emoticons.”  ;-) 
words/characters.  To generate more complex sig-
nals, multimodal systems decompose their output 
into several sub-streams – we’ll refer to these as 
“channels.”  Dividing a communication signal into 
channels can make it easier to represent the various 
choices the generator must make; generally, a dif-
ferent processing component of the system will 
govern the output of each channel.  The trade-off is 
that these channels must be coordinated over time.   
Instead of thinking of channels as dividing a 
communication signal, we can think of them as 
groupings of individual values in the data stream 
that are related in some way.  The channels of a 
multimodal NLG system generally correspond to 
natural perceptual/conceptual groupings called 
“modalities.”  Coarsely, audio and visual parts of 
the output are thought of as separate modalities.  
When parts of the output appear on different por-
tions of the display, then they are also generally 
considered separate modalities.  For instance, a 
multimodal NLG system for automobile driving 
directions may have separate processing channels 
for text, maps, other graphics, and sound effects.  
An ECA system may have separate channels for 
eye gaze, facial expression, manual gestures, and 
speech audio of the animated character.   
When a language has no commonly-known writ-
ing system – as is the case for ASL – then it’s not 
possible to build a text-based NLG system.  We 
must produce an animation of a character (like an 
ECA) performing ASL; so, we must specify how 
the hands, eye gaze, mouth shape, facial expres-
sion, head-tilt, and shoulder-tilt are coordinated 
over time.  With no conventional string-encoding 
of ASL (that would compress the signal into a sin-
gle stream), an ASL signal is spread over multiple 
channels of the output – a departure from most 
Multimodal NLG systems, which have a single 
linguistic channel/modality that is coordinated with 
other non-linguistic resources (Figure 1). 
Figure 1: Linguistic Channels in Multimodal Systems 
English Text 
Driving Maps 
Other Graphics 
Prototypical Driving- 
Direction System 
Sound Effects 
Left Hand  
Head-Tilt 
Eye-Gaze 
Facial Expression 
Right Hand  
Prototypical ASL System 
Li
ngu
ist
ic 
Ch
anne
ls 
38
Of course, we could invent a string-based nota-
tion for ASL so that we could use traditional text-
based NLG technology.  (Since ASL has no writ-
ing system, we would have to invent an artificial 
notation.)  Unfortunately, since the users of the 
system wouldn’t be trained in this new writing sys-
tem, it could not be used as output; we would still 
need to generate a multimodal animation output. 
An artificial writing system could only be used for 
internal representation and processing, However, 
flattening a naturally multichannel signal into a 
single-channel string (prior to generating a mul-
tichannel output) can introduce its own complica-
tions to the ASL system’s design.  For this reason, 
this project has been exploring ways to represent 
the hierarchical linguistic structure of information 
on multiple channels of ASL performance (and 
how these structures are coordinated or uncoordi-
nated across channels over time). 
Some multimodal systems have explored using 
linguistic structures to control (to some degree) the 
output of multiple channels.  Research on generat-
ing animations of a speaking ECA character that 
performs meaningful gestures (Kopp et al., 2004) 
has similarities to this ASL project.  First of all, the 
channels in the signal are basically the same; an 
animated human-like character is shown onscreen 
with information about eye, face, and arm move-
ments being generated.  However, an ASL system 
has no audio speech channel but potentially more 
fine-grained channels of detailed body movement. 
The less superficial similarity is that (Kopp et. 
al, 2004) have attempted to represent the semantic 
meaning of some of the character’s gestures and to 
synchronize them with the speech output.  This 
means that, like an ASL NLG system, several 
channels of the signal are being governed by the 
linguistic mechanisms of a natural language.  
Unlike ASL, the gesture system uses the speech 
audio channel to convey nearly all of the meaning 
to the user; the other channels are generally used to 
convey additional/redundant information.  Further, 
the internal structure of the gestures is not gener-
ally encoded in the system; they are typically 
atomic/lexical gesture events which are synchro-
nized to co-occur with portions of speech output.  
A final difference is that gestures which co-occur 
with English speech (although meaningful) can be 
somewhat vague and are certainly less systematic 
and conventional than ASL body movements.  So, 
while both systems may have multiple linguistic 
channels, the gesture system still has one primary 
linguistic channel (audio speech) and a few chan-
nels controlled in only a partially linguistic way. 
3 This English-to-ASL MT Design 
The linguistic and multimodal issues discussed 
above have had important consequences on the 
design of our English-to-ASL MT system.  There 
are several unique features of this system caused 
by: (1) ASL having multiple linguistic channels 
that must be coordinated during generation, (2) 
ASL having both an LS and a CP form of signing, 
(3) CP signing visually conveying 3D spatial rela-
tionships in front of the signer’s torso, and (4) ASL 
lacking a conventional written form.  While ASL-
particular factors influenced this design, section 5 
will discuss how this design has implications for 
NLG of traditional written/spoken languages. 
3.1 Coordinating Linguistic Channels 
Section 2 mentioned that this project is developing 
multichannel (non-string) encodings of ASL ani-
mation; these encodings must coordinate multiple 
channels of the signal as they are generated by the 
linguistic structures and rules of ASL.  Kopp et al. 
(2004) have explored how to coordinate meaning-
ful gestures with speech signal during generation; 
however, their domain is somewhat simpler.  Their 
gestures are atomic events without internal hierar-
chical structure.  Our project is currently develop-
ing grammar-like coordination formalisms that 
allow complex linguistic signals on multiple chan-
nels to be conveniently represented.2 
3.2 ASL Computational Linguistic Models 
This project uses representations of discourse, se-
mantics, syntax, and (sign) phonology tailored to 
ASL generation (Huenerfauth, 2004b).  In particu-
lar, since this MT system will generate animations 
of classifier predicates (CPs), the system consults a 
3D model of real-world scenes under discussion.  
Further, since multimodal NLG requires a form of 
scheduling (events on multiple channels are coor-
dinated over a performance timeline), all of the 
linguistic models consulted and modified during 
ASL generation are time-indexed according to a 
timeline of the ASL performance being produced. 
                                                        
2 Details of this work will be described in future publication. 
39
Previous ASL phonological models were de-
signed to represent non-CP ASL, but CPs use a 
reduced set of handshapes, standard eye-gaze and 
head-tilt patterns, and more complex orientations 
and motion paths.  The phonological model devel-
oped for this system makes it easier to specify CPs. 
Because ASL signers can use the space in front 
of their body to visually convey information, it is 
possible during CPs to show the exact 3D layout of 
objects being discussed.  (The use of channels rep-
resenting the hands means that we can now indi-
cate 3D visual information – not possible with 
speech or text.)  To represent this 3D detailed form 
of meaning, this system has an unusual semantic 
model for generating CPs.  We populate the vol-
ume of space around the signer’s torso with invisi-
ble 3D objects representing entities discussed by 
CPs being generated (Huenerfauth, 2004b).  The 
semantic model is the set of placeholders around 
the signer (augmented with the CP handshape used 
for each).   Thus, the semantics of the “car parked 
next to the house” example (section 1.1) is that a 
‘bulky’ object occupies a particular 3D location 
and a ‘vehicle’ object moves toward it and stops. 
Of course, the system will also need more tradi-
tional semantic representations of the information 
to be conveyed during generation, but this 3D 
model helps the system select the proper 3D mo-
tion paths for the signers’ hands when “drawing” 
the 3D scenes during CPs.  The work of (Kopp et 
al., 2004) studies gestures to convey spatial infor-
mation during an English speech performance, but 
unlike this system, they use a logical-predicate-
based semantics to represent information about 
objects referred to by gesture.  Because ASL CPs 
indicate 3D layout in a linguistically conventional 
and detailed way, we use an actual 3D model of 
the objects being discussed.  Such a 3D model may 
also be useful for ECA systems that wish to gener-
ate more detailed 3D spatial gesture animations.   
The discourse model in this ASL system records 
features not found in other NLG systems.  It tracks 
whether a 3D location has been assigned to each 
discourse entity, where that location is around the 
signer, and whether the latest location of the entity 
has been indicated by a CP.  The discourse model 
is not only relevant during CP performance; since 
ASL LS performance also assigns 3D locations to 
objects under discussion (for pronouns and verbal 
agreement), this model is also used for LS. 
3.3 Generating 3D Classifier Predicates 
An essential step in producing an animation of an 
ASL CP is the selection of 3D motion paths for the 
computer-generated signer’s hands, eye gaze, and 
head tilt.  The motion paths of objects in the 3D 
model described above are used to select corre-
sponding motion paths for these parts of the 
signer’s body during CPs.  To build the 3D place-
holder model, this system uses preexisting scene-
visualization software to analyze an English text 
describing the motion of real-world objects and 
build a 3D model of how the objects mentioned in 
text are arranged and move (Huenerfauth, 2004b).  
This model is “overlaid” onto the volume in front 
of the ASL signer (Figure 2).  For each object in 
the scene, a corresponding invisible placeholder is 
positioned in front of the signer; the layout of 
placeholders mimics the layout of objects in the 3D 
scene.  In the “car parked next to the house” exam-
ple, a miniature invisible object representing a 
‘house’ is positioned in front of the signer’s torso, 
and another object (with a motion path terminating 
next to the ‘house’) is added to represent the ‘car.’   
The locations and orientations of the placehold-
ers are later used by the system to select the loca-
tions and orientations for the signer’s hands while 
performing CPs about them.  So, the motion path 
calculated for the car will be the basis for the 3D 
motion path of the signer’s hand during the classi-
fier predicate describing the car’s motion. Given 
the information in the discourse/semantic models, 
the system generates the hand motions, head-tilt, 
and eye-gaze for a CP.  It stores a library contain-
ing templates representing a prototypical form of 
each CP the system can produce.  The templates 
TEXT: 
THE CAR 
PARKED NEXT 
TO THE HOUSE.
Visualization
Software
3D MODEL:
Overlay in
front of ASL
signer
Convert to 3D
placeholder
locations/paths
Figure 2: Converting English Text to 3D Placeholder
40
are planning operators (with logical pre-conditions, 
monitored termination conditions, and effects), 
allowing the system to “trigger” other elements of 
ASL signing performance that may be required 
during a CP.  A planning-based NLG approach, 
described in (Huenerfauth, 2004b), is used to select 
a template, fill in its missing parameters, and build 
a schedule of the animation events on multiple 
channels needed to produce a sequence of CPs.   
3.4 A Multi-Path Architecture 
A multimodal NLG system may have several pres-
entation styles it could use to convey information 
to its user; these styles may take advantage of the 
various output channels to different degrees.  In 
ASL, there are multiple channels in the linguistic 
portion of the signal, and not surprisingly, the lan-
guage has multiple sub-systems of signing that 
take advantage of the visual modality in different 
ways.  ASL signers can select whether to convey 
information using lexical signing (LS) or classifier 
predicates (CPs) during an ASL performance (sec-
tion 1.1).  These two sub-systems use the space 
around the signer differently; during CPs, locations 
in space associated with objects under discussion 
must be laid out in a 3D manner corresponding to 
the topological layout of the real-world scene un-
der discussion.  Locations associated with objects 
during LS (used for pronouns and verb agreement) 
have no topological requirement.  The layout of the 
3D locations during LS may be arbitrary. 
The CP generation approach in section 3.3 is 
computationally expensive; so, we would only like 
to use this processing pathway when necessary.  
English input sentences not producing classifier 
predicates would not need to be processed by the 
visualization software; in fact, most of these sen-
tences could be handled using the more traditional 
MT technologies of previous systems.  For this 
reason, our English-to-ASL MT system has multi-
ple processing pathways (Huenerfauth, 2004a).  
The pathway for handling English input sentences 
that produce CPs includes the scene visualization 
software, while other input sentences undergo less 
sophisticated processing using a traditional MT 
approach (that is easier to implement).  In this way, 
our CP generation component can actually be lay-
ered on top of a pre-existing English-to-ASL MT 
system to give it the ability to produce CPs. This 
multi-path design is equally applicable to the archi-
tecture of written-language MT systems.  The de-
sign allows an MT system to combine a resource-
intensive deep-processing MT method for difficult 
(or important) inputs and a resource-light broad-
coverage MT method for other inputs.   
3.5 Evaluation of Multichannel NLG 
The lack of an ASL writing system and the mul-
tichannel nature of ASL can make NLG or MT 
systems which produce ASL animation output dif-
ficult to evaluate using traditional automatic tech-
niques.  Many such approaches compare a string 
produced by a system to some human-produced 
‘gold-standard’ string.  While we could invent an 
artificial ASL writing system for the system to 
produce as output, it’s not clear that human ASL 
signers could accurately or consistently produce 
written forms of ASL sentences to serve as ‘gold 
standards’ for such an evaluation.  And of course, 
real users of the system would never be shown arti-
ficial “written ASL”; they would see full anima-
tions instead.  User-based studies (where ASL 
signers evaluate animation output directly) may be 
a more meaningful measure of an ASL system. 
We are planning such an evaluation of a proto-
type CP-generation module of the system during 
the summer/fall of 2005.  Members of the deaf 
community who are native ASL signers will view 
animations of classifier predicates produced by the 
system.  As a control, they will also be shown an-
imations of CPs produced using 3D motion capture 
technology to digitally record the performance of 
CPs by other native ASL signers.  Their evaluation 
of animations from both sources will be compared 
to measure the system’s performance.  The mul-
tichannel nature of the signal also makes other in-
teresting experiments possible.  To study the 
system’s ability to animate the signer’s hands only, 
motion-captured ASL could be used to animate the 
head/body of the animated character, and the NLG 
system can be used to control only the hands of the 
character.  Thus, channels of the NLG system can 
be isolated for evaluation – an experimental design 
only available to a multichannel NLG system. 
4 Unique Design Features for ASL NLG 
The design portion of this English-to-ASL project 
is nearly complete, and the implementation of the 
system is ongoing.  Evaluations of the system will 
41
be available after the user-based study discussed 
above; however, the design itself has highlighted 
interesting issues about the requirements of NLG 
software for sign languages like ASL.   
The multichannel nature of ASL has led this 
project to study mechanisms for coordinating the 
values of the linguistic models used during genera-
tion (including the output animation specification 
itself).  The need to handle both the LS and CP 
subsystems of the language has motivated: a multi-
path MT architecture, a discourse model that stores 
data relevant to both subsystems, a model of the 
space around the signer capable of storing both LS 
and CP placeholders, and a phonological model 
whose values can be specified by either subsystem.   
Since this English-to-ASL MT system is the first 
to address ASL classifier predicates, designing an 
NLG process capable of producing the 3D loca-
tions and paths in a CP animation has been a major 
design focus for this project.  These issues have 
been addressed by the system’s use of a 3D model 
of placeholders produced by scene-visualization 
software and a planning-based NLG process oper-
ating on templates of prototypical CP performance. 
5 Applications Beyond Sign Language 
Sign language NLG requires 3D spatial representa-
tions and multichannel coordinated output, but it’s 
not unique in this requirement.  In fact, generation 
of a communication signal for any language may 
require these capabilities (even for spoken lan-
guages like English).  We have mentioned 
throughout this paper how gesture/speech ECA 
researchers may be interested in NLG technologies 
for ASL – especially if they wish to produce ges-
tures that are more linguistically conventional, in-
ternally complex, or 3D-topologically precise.   
Many other computational linguistic applica-
tions could benefit from an NLG design with mul-
tiple linguistic channels (and indirectly benefit 
from ASL NLG technology).  For instance, NLG 
systems producing speech output could encode 
prosody, timing, volume, intonation, or other vocal 
data as multiple linguistically-determined channels 
of the output (in addition to a channel for the string 
of words being generated).  And so, ASL NLG 
research not only has exciting accessibility benefits 
for deaf users, but it also serves as a research vehi-
cle for NLG technology to produce a variety of 
richer-than-text linguistic communication signals. 
Acknowledgments  
I would like to thank my advisors Mitch Marcus 
and Martha Palmer for their guidance, discussion, 
and revisions during the preparation of this work. 
References 
Cassell, J., Sullivan, J., Prevost, S., and Churchill, E. 
(Eds.). 2000. Embodied Conversational Agents. 
Cambridge, MA: MIT Press. 
Holt, J. 1991. Demographic, Stanford Achievement Test 
- 8th Edition for Deaf and Hard of Hearing Students: 
Reading Comprehension Subgroup Results. 
Huenerfauth, M. 2003. Survey and Critique of ASL 
Natural Language Generation and Machine Transla-
tion Systems. Technical Report MS-CIS-03-32, 
Computer and Information Science, University of 
Pennsylvania. 
Huenerfauth, M. 2004a. A Multi-Path Architecture for 
Machine Translation of English Text into American 
Sign Language Animation. In Proceedings of the 
Student Workshop of the Human Language Tech-
nologies conference / North American chapter of the 
Association for Computational Linguistics annual 
meeting: HLT/NAACL 2004, Boston, MA, USA. 
Huenerfauth, M. 2004b. Spatial and Planning Models of 
ASL Classifier Predicates for Machine Translation. 
10th International Conference on Theoretical and 
Methodological Issues in Machine Translation: TMI 
2004, Baltimore, MD. 
Huenerfauth, M. 2005. American Sign Language Spatial 
Representations for an Accessible User-Interface.  In 
3rd International Conference on Universal Access in 
Human-Computer Interaction.  Las Vegas, NV, USA. 
Kopp, S., Tepper, P., and Cassell, J. 2004. Towards 
Integrated Microplanning of Language and Iconic 
Gesture for Multimodal Output. Int’l Conference on 
Multimodal Interfaces, State College, PA, USA. 
Liddell, S. 2003. Grammar, Gesture, and Meaning in 
American Sign Language. UK: Cambridge U. Press. 
Mitchell, R. 2004. How many deaf people are there in 
the United States. Gallaudet Research Institute, Grad 
School & Prof. Progs. Gallaudet U.  June 28, 2004. 
http://gri.gallaudet.edu/Demographics/deaf-US.php 
Morford, J., and MacFarlane, J. 2003. Frequency Char-
acteristics of ASL.  Sign Language Studies, 3:2. 
Neidle, C., Kegl, J., MacLaughlin, D., Bahan, B., and 
Lee R.G. 2000. The Syntax of American Sign Lan-
guage: Functional Categories and Hierarchical Struc-
ture. Cambridge, MA: The MIT Press. 
42
