HANDLING ILL-FORMED INPUT: SESSION INTRODUCTION 
Ralph M. Weischedel 
Department of Computer and Information Sciences 
University of Delaware 
Newark, Delaware 19711 
i. Introduction 
Suppose we call "normative" any sys- 
tem based on a set of constraints (wheth- 
er pragmatic, semantic, or syntactic). 
Input that violates the constraints of a 
system could be termed "ill-formed". Son- 
dheimer and Weischedel (1980) identify 
two general classes of input appearing 
ill-formed to a normative system. Input 
will be called 
absolutely ill-formed, if native 
speakers generally agree that it 
violates one or more linguistic con- 
straints, or 
relativel Z ill-formed, if it violates 
some co n s t r a----~nt--\[~---o f the computa- 
tional system, though native speakers 
perceive nothing odd about it. 
Examples of absolute ill-formedness 
include misspellings, mistypings, 
mispunctuations, tense and number errors, 
word order problems, run-on sentences, 
sentence fragments, extraneous forms, and 
meaningless sentences. Examples of rela- 
tively ill-formed input include unknown 
words and requests that are beyond the 
limits of either the computer system or 
the natural language interface. 
In natural language access (e.g. 
English access) to information systems, 
the magnitude of the problem of absolute 
ill-formedness can be seen in several 
case studies. If one includes telegraphic 
and elliptical constructions in the class 
of absolute ill-formedness, then case 
s~udies reported in Thompson (1980) and 
Eastman and McLean (1981) indicate that 
as much as 25% of queries to question- 
answering systems are absolutely ill- 
formed. On the other hand, no matter how 
large the dictionary, grammar, and under- 
lying system, there will always be unk- 
nown words and phrases (e.g. proper 
names) and impossible requests (due to 
user misconceptions of the capabilities 
of the underlying system). 
2. Brief overview of the papers of this 
session 
This session consists of papers by 
Jensen and Heidorn; Marsh; and Granger et 
al. The paper by Jensen and Heidorn 
presents a particular heuristic for 
dealing with unparsable input. Since 
they have separate explicit heuristics 
for specific ungrammatical forms, a sig- 
nificant proportion of unparsable input 
in their system will be relatively ill- 
formed. One of the goals of the EPISTLE 
project, of which this is a part, is cri- 
tiquing business letters. 
Marsh's paper describes a technique 
for filling in material omitted from 
fragmentary inputs. Both syntactic in- 
formation and domain-specific constraints 
on semantic classes are used. This is 
part of an ongoing effort in extracting a 
data base from free-text medical records, 
such as narrative discharge summaries. 
The paper by Granger et el. reports 
on NOMAD, a system for taking cryptic, 
errorful naval ship-to-shore messages and 
generating well-formed versions. The 
paper describes the methods used for p~o- 
cessing unknown words, fragments, missing 
punctuation, and tense errors. 
*This material is based upon work partially sup- 
ported by the National Science Foundation under 
Grant No. IST-8009673. 
One way to classify ill-formednes3 
work is by the choices made regarding 
several issues. Each of the following 
sections will present one issue. 
89 
3. Application area and responding to 
ill-formedness 
Ill-formedness processing has been 
examined in many ~pplication areas, in- 
cluding data base access (Hendrix, et 
al., 1978), building a data base (Marsh, 
1983), grammar checking (Jensen and 
Heidorn, 1983),. generating well-formed 
messages from ill-formed and incomplete 
ones (Granger, et al., 1983) and 
intelligent computer-assisted language 
instruction (Weiachedel, et al., 1978). 
Proper response to an ill-formed 
input depends on the application environ- 
ment and the user. For example, suppose 
a presupposition of an input is incorrect 
according to the computer model. In a 
language instruction environment, the 
system should correct the erroneous in- 
put, informing the student of the cause 
of the error; this makes sense since stu- 
dents are error prone in learning a 
language. In data base access, the sys- 
tem might inform the user of the false 
presupposition and suggest alternative 
queries (Kaplan, 1978), since the user 
does not usually benefit from the empty 
set as a response. On the other hand, in 
the application Granger is investigating, 
the system should check an assumption it 
has made regarding the incomplete text 
and try an alternative, since the incon- 
sistency may stem from an assumption the 
system made in completing fragmentary 
input. 
4. The role of constraints 
Perhaps ill-formedness should be 
handled by simply not encoding some con- 
straints in the normative system. Waltz 
(1978) and Schank, et al. (1980) have 
taken this approach for a large class of 
syntactic constraints. In certain appli- 
cations, one can get by with such an 
approach due to the great amount of 
~edundancy in natural language. However, 
it clearly will not work in gener%l, sin- 
ce constraints generally carry meaning 
and tri,n the search space. All three 
papers of this session capitalize on con- 
straints (or expectations), and propose 
mechanisms for processing ill-formedness. 
Given a commitment to employ con- 
straints rather than ignoring them, there 
at9 still the following design issues : 
a) whether two separate systems, 
perhaps running in parallel, should 
be built for well-formedness and for 
ill-formedness 
b) whether ill-formed processing 
can be stated as explicit modifica- 
tions ~) well-formedness processing, 
and 
c) whether a metric can be use/ 
instead of employing any special 
procedures for ill-formedness. 
Weischedel and Sondheimer (1981) argue 
for an approach that explicitly relates 
ill-formedness processing to the rules of 
the normative system via meta-rules 
(rules that operate on rules). A more 
complete discussion of the alternatives 
appears there. 
5. Cate~orizatiqn of ellipsis, 
conjunction, and other gray cases 
It is debatable whether certain 
phenomena should be classified as well- 
formed or not. Ellipsis (fragmentary 
input which in context conveys a complete 
thought) is an example. In such cases, 
the definition of "ill-formedness" in 
terms of a normative system, as well as 
the distinction between absolute and 
relative ill-formedness, sheds light on 
the issue. In Marsh's system, syntactic 
forms for ellipses are explicitly coded 
in the normative system. Jensen and 
Heidorn do not include rules for frag- 
ments in the grammar, but view it as 
relative ill-formedness to be processed 
by a recovery strategy. Granger, et al. 
also view ellipsis as relatively ill- 
formed. 
Conjunction formation is another 
interesting case. Though use of conjunc- 
tion is clearly grammatical, a number of 
individuals (Sager, 1973; Woods, 1973; 
Kwasny and Sondheimer, 1981) have argued 
that it should not be included in the 
normative grammar, but rather should be 
processed via a recovery strategy. 
Therefore, they have argued for treating 
it as relatively ill-formed. 
I suspect that categorizing almost 
any particular constraint as normative 
could be the basis for argument. The 
criteria for deciding whether a con- 
straint should be included in the norma- 
tive system include at least the fol- 
lowing: 
a) whether a native speaker would 
edit inputs that violate it, 
b) whether violating the constraint 
can yield useful inferences, 
c) whether examples exist where the 
constraint carries meaning, 
d) whether the constraint° if 
classified as normative, trims the 
search space, and 
e) whether a processing strategy 
for the constraint can be stated 
more easily as a modification of 
normative processing, as in the case 
of conjunction. 
90 
6. Knowled@e sources 
In processing ill-formedness, there 
is usually more than one alternative for 
diagnosing the problem and for recover- 
ing. Oftentimes there is more than one 
alternative for what was intended. What 
kinds of knowledge are brought to bear on 
the problem? Jensen and Heidorn use syn- 
tactic information and an ordering heu- 
ristic in their parser. Marsh uses syn- 
tactic information and semantics (pri- 
marily selection restrictions). Granger 
et al. also employ syntactic and semantic 
constraints, but additionally employ 
"scripts" of stereotypical events in the 
environment of naval ship-to-shore mes- 
sages. 
Using more classes of information to 
infer what was intended is an open prob- 
lem~ the kinds of semantic and pragmatic 
knowledge that could be helpful have 
barely been tapped. 
7. Control 
much needed to determine the effec- 
tiveness and costs of competing stra- 
tegies. 
Publishing additional collections of 
ill-formed input is critical. The pat- 
terns of behavior evident in such 
collections not only suggest heuristics 
for ill-formedness processing but also 
provide a basis for benchmarks upon which 
to base empirical comparisons. 
One other area needing much research 
is models of particular users, their 
plans, and goals. This is important to 
infer the intended meaning of an indivi- 
dual, since many explanations exist. For 
instance, when no interpretation can be 
found, spelling corrections, inferring 
the meaning of an unknown word, or relax- 
ing a syntactic or semantic constraint 
are all possibilities. Granger, et al. 
(1983) and Allen, et al. (1983) both 
report progress on using pragmatic infor- 
mation to deal with fragmentary input. 
The control mechanisms of the norma- 
tive system (e.g. bottom-up versus top- 
down and the point at which semantic con- 
straints are used) are not of concern to 
us here. Rather, what is of interest is 
a) the point at which ill- 
formedness strategies are employed, 
b) the mechanism for identifying 
what the proble~ is, 
c) the nature of response, if any, 
once a specific hypothesis of the 
problem is made, and 
d) the search strategy for selec- 
ting an intended interpretation. 
The obvious benefit of ill- 
formedness research is more natural, 
easy-to-use systems. An additional bene- 
fit is that study of ill-formedness 
should lead to better understanding of 
how normative systems should be designed. 
8. Future directions 
As evident from this session, the 
processing of ill-formed input has become 
a very active research topic that is of 
critical importance in a wide variety of 
applications. Yet, there is much to be 
done. There are many kinds of ill- 
formedness for which better heuristics 
are needed. 
~nother need is empirical studies. 
Controlling the processing of ill-formed 
input is a substantial problem no matter 
what approach one takes, since processing 
ill-formedness requires relaxing the very 
constraints that trim the search space 
for possible interpretations. Because 
control is such an important issue, 
thorough, rigorous empirical studies are 
Many alternatives exist for these de- 
cisions; an overview of them appears in 
Sondheimer and Weischedel (1980). 


References 

Allen, James F., Alan M. Frisch, Diane J. 
Litman, "ARGOT: The Rochester Dialogue 
System", Proceedings of the National 
Conference on Artlfic~l -'~tell~qenc--'~--~e," 
(1982), 66-7~. ...... 

Eastman, C. M. and D. S. McLean, "On the 
Need for Parsing Ill-formed Input," 
American Journal of Computational 
Linguistics, 1981, 2~7. 

Granger, Richard H., Chris J. Staros, 
Gregory B. Taylor, and Riku Yoshii, 
"Scruffy Text Understanding: Design and 
Implementation of the NOMAD System", ANLP, 1983. 

Hendrix, G. G., E. D. Sacerdoti, D. Sagalowicz and J. Slocum, "Developing a 
Natural Language Interface to Complex 
Data", ACM Transactions on Database 
Szstems, 3, 2, (1978), i05-147-7 

Jensen, Karen and George E. Heinorn, "The 
'Fitted' Parse: 100% Parsing Capability 
in a Syntactic Grammar of English", ANLP, 1983. 

Kaplan, S. J., "Indirect Responses t9 
Loaded Questions," Theoretical Issues in 
Natural Language Pr6ces6in~i2-~ ~si~} 
o---~-~llinois at ~~h'ampaign, July~ 1978. 

Kwasny, S. C., and N. K. Sondheimer, 
"Relaxation Techniques for Parsing Ill-Formed Input", American Journal of 
Computational Linguistics 1981
99-108. 

Marsh, Elaine, "Utilizing Domain-Specific 
Information for Processing Compact Text", 
ANLP, 1983. 

Sager, Naomi, "The String Parser for 
Scientific Literature." In R. Rustin, 
Ed., Natural Language Processing. New 
York: Algorithmics Press, 1973. 

Schank, Roger C., Michael Lebowitz, and 
Lawrence Birnbaum "An Integrated 
Understander", American Journal of 
Computational Linguistics, 1980, 
13-30. 

Sondheimer, N. K. and R. M. Weischedel, 
"A Rule-Based Approach to Ill-Formed 
Input," Proceedings of the Eighth 
International Conference on Computational 
Linguistics Tokyo, October 1980, 

Thompson, B. H., "Linguistic Analysis of 
Natural Language Communication with Computers", Proceedings of the Eighth 
International Conference on Computational 
Linguistics, Tokyo, October, 1980, 
190-201. 

Waltz, D. L., "An English Language Ques- 
tion Answering System for a Large Rela- 
tional Database", Communications of the 
ACM, 21, 7, (1978), 526-539. 

Wei~chedel, R.M. and N. K. Sondheimer, "A 
Framework for Processing Ill-Formed 
Input", Dept. of Computer & Information 
Sciences, University of Delaware, Newark, 
DE, 1981. 

Weischedel, R. M., W. M. Voge, and M. 
James, "An Artificial Intelligence 
Approach to Language Instruction", 
Artificial Intelligence, i0, (1978), 
225-240. 

Woods, W. A., "An Experimental Parsing 
System for Transition Network Grammars." 
In R. Rustin, Ed., Natur~l Language 
Processing. New Yor-k? ~igorlth~-ics 
Press, 1973. 
