RESEARCH IN NATURAL LANGUAGE PROCESSING 
A. Joshi, M. Marcus, M. Steedman, B. Webber 
Department of Computer and Information Science 
University of Pennsylvania 
Philadelphia, PA 19104 
PROJECT GOALS 
The main objective is to develop robust methods for the 
understanding and generation of both written and spo- 
ken human language, including but not limited to En- 
glish. Penn is pursuing development of: (1) New mathe- 
matical and computational frameworks which are highly 
constrained, yet adequate to allow a simple, concise de- 
scription of complex linguistic phenomena. These new 
frameworks are tested by the explicit encoding within 
each framework of a wide range of phenomena across a 
diverse set of human languages. (2) Both statistical and 
symbolic learning methods which automatically extract 
and effectively utilize the implicit linguistic knowledge 
in the Penn Treebank and the corpora of the Linguistic 
Data Consortium. These techniques have been tested 
against the performance of the best current methods. 
RECENT RESULTS 
• In a lexicalized grammar such as the lexicalized tree- 
adjoining grammar (LTAG), each lexical item is as- 
sociated with one or more elementary trees (struc- 
tures), called supevtags. We have developed tech- 
niques to eliminate or substantially reduce the su- 
pertag assignment ambiguity by using local lexical 
dependencies and their distribution, prior to pars- 
ing. After this step only explicit indication of sub- 
stitutions and adjoinings must be indicated to com- 
plete parsing. Preliminary experiments on short 
fragments show a success rate of 88%, with experi- 
ments continuing on full sentences from WS3 mate- 
rial. 
• The Information-Based Intonation Synthesis (IBIS) 
spoken reply system has been extended by a richer 
semantics for the assignment of stress on the basis 
of contrast in the domain of discourse. Synthesis of 
spoken responses as speech waves bearing an intona- 
tion contour appropriate to the context of utterance 
has thereby been considerably improved. 
• A weakly supervised symbolic learning algorithm 
called Error Based Transformation Learning has 
been developed that matches or beats the perfor- 
mance of the best standard methods for a range of 
key language analysis tasks. This method has also 
been used for part of speech tagging for several lan- 
guages other than English with very good results. 
A new algorithm for word-sense determination per- 
forms as least as well as existing algorithms, while 
only using only a window of five words around the 
target word, as opposed to 100 words for these ex- 
isting methods. 
PLANS 
Apply part-of-speech disambiguation strategies to 
the disambiguation of lexical category assignm- 
ments words in a combinatory categorial parser. 
Port the IBIS spoken response generator to the 
larger domain involved in the task of critiquing of 
Medical Diagnosis by an expert system. 
Explore statistical morphology induction, lexi- 
cal disambiguation, and language modeling with 
stochastic dependency grammars. 
Test the XTAG system on a corpus and build a TAG 
parsed corpus to serve as the basis for statistical 
experiments with the TAG grammar and parser. 
Contribute to a model of limited processing for dis- 
course, using LDC corpora as the basis for an empir- 
ical analysis of bottom-up cues to discourse struc- 
ture, such as variation in the forms of referring 
expressions, and prosodic marking by topline and 
baseline variation. 
Develop part-of-speech taggers and morphological 
learners for a range of languages other than English. 
Develop the 'strategic' or discourse-planning com- 
ponent of the spoken reply system. 
476 
