PANEL 
Language Engineering : The Real Bottle Neck 
of Natural Language Processing 
Panel Organizer, Makoto Nagao 
Department of Electrical Engineering 
Kyoto University, Sakyo, Kyoto, Japan 
The bottle neck in building a practical natural 
language processing system is not those problems which 
have been often discussed in research papers, but in 
ilandling much more dirty, exceptional (for theoreticians, 
but we frequently encounter) expressions. This panel 
will focus on the problem which has been rarely written 
but has been argued informally among researchers who 
have tried to build a practical natural language process- 
ing system at least once. 
Theory is important and valuable for the explana- 
tion and understanding, but is essentially the first 
order approximation of a target object. As for language~ 
current theories are Just for the basic part of the 
language structure. Real language usage is quite differ- 
ent from the basic language structure and a supposed 
mechanism of interpretation. Natural language process- 
ing system must cover real language usage as much as 
possible. The system model must be designed in such a 
way that it is clearly understandable by the support of 
a powerful linguistic theory, and still can accept 
varieties of exceptional linguistic phenomena which the 
theory is difficult to treat. How we can design such a 
system is a major problem in natural language process- 
ing, especially for machine translation between the 
languages of different linguistic families. We have to 
be concerned with both linguistic and non-llngulstlc 
world. While we have to study these difficult problems, 
we must not forget about the realizability of a useful 
system from the standpoint of engineering. 
I received valuable comments from Dr. Karen Jensen 
who cannot participate in our panel, and kindly offered 
me to use her comments freely in our panel. I want to 
cite her comments in the followings. 
Why Computational Grammarians Can Be 
Skeptical About Existing Linguistic Theories 
Karen .lensen 
IBM TJ Watson Research Center 
Yorktown Heights, NY10598, U.S.A 
i. We need to deal with huge amounts of data (number of 5. We are not interested in using the most constrained/ 
sentences, paragraphs, etc.). Existing linguistic restricted formalism. LTs generally are, because of 
theories (LTs) play with small amounts of data. 
2. The data involve many (and messy) details. LTs are 
prematurely fond of simplicity. For example: punctua- 
tion is very important for processing real text, but 
LTs have nothing to say about it. (This is actually 
strange, since punctuation represents -- to some 
extent -- intonational contours, and these are 
certainly linguistically significant.) 
3. There is no accepted criterion for when to abandon an 
LT; one can always modify theory to fit counterexam- 
ples. We have fairly clear criteria: if a computa- 
tional system cannot do its Job in real time, then it 
fails. 
4. We need to use complex attribute-value strnctures, 
which cannot be manipulated on paper or on a black- 
board. "Trees" are only superficially involved. 
This means we are absolutely committed to computation. 
LTs have various degrees of commitment. 
Existing linguistic theories ate of limited usefulness to 
broad-coverage, real-world computational grammars, perhaps 
largely because existing theorists focus on limited notions of 
"grammaticality," rather than on the goal of dealing, in some 
fashion, with any piece of input text. Therefore, existing the- 
ories play the game of ruling out many strings of a language, 
rather than the game of trying to assign plausible structures 
to all strings. We suggest that the proper goal of a working 
computational grammar is not to accept or reject strings, but to 
assign the most reasonable structure to every input string, and 
to comment on it, when necessary. (This goal does not seem 
to be psychologically implausible for human beings, either.) 
For years it has seemed theoretically sound to assume 
that the proper business of a grammar is to describe all of the 
grammatical structures of its language, and only those stmc- 
trees that ate granlmatical: 
The grammar of L will thus be a device that 
generates all of the grammatical sequences of L and 
none of rhe ungrammatical ones. (Chomsky 1957, 
p. 13) 
448 
supposed claims about language processing mechanisms° 
6. We are interested in uniqueness as much as in gener- 
ality. ITs usually are not. 
7. We are more interested in coverage of the gran~ar 
than in completenesslof the grammar. LTs generally 
pursue completeness. 
8. We aim for "all," but not "only" the grammatical 
constructions of n natural language. Defining un- 
grammatical structures is, by and large, a futile 
task (Alexis Manaster-Ramer~ Wlodzimierz Zadrozny). 
9. Existing LTs give at besta high-level specification 
of the structure of natural language. Writing a 
computational granmmr is llke writing a real program 
given very abstract specs (Nelson Uorrea). 
i0. We are not skeptical of theory, Just of existing 
theories. 
At first blush, it seems unnecessary to conjure up any 
justification for titis claim. Almost by definition, the proper 
business of a grammar should be grammaticality. However, it 
has been notoriously difficult to draw a line between "gram. 
maticai" sequences and "ungnmunalicai" sequences, for any 
natural human language. It may even be provably impossi- 
ble to define precisely rhe notion of grammaticality for any 
language. Nalural language deals with vague predicatus, and 
might itself be called a vague predicator. 
This being tree, it still seems worthwhile to ~ at parsing 
ALL of the gr,'unmalical strings of a language, but parsing 
ONLY the grammatical strings becomes a dubious enteq~rise 
at best. Arguments for doing so reduce either to dogma, or to 
some general notion of proptiety. Argmnenis against, however, 
arc easy to come by. Leaving theoretical considerations aside 
for the moment, consider these praguratic ones: 
(a) The diachronic argumeut. The creativity of human 
use of language is great, and language systems are always 
changing. A construction that was once unacceptable becomes 
acceptable over time, and vice versa. Even if a grammar could 
d~:seribe at| and only file g~armnatical .sequences today, the 
t~ane may uot be tree tomon'ow. So there is, at best, only an 
~u:ademic iuterest in only-g~nmuaticai stmctul'es. 
(b) The ptacrical argumeut. In tile area of alpplied com- 
p~ltational linguistics, ill-formeti input is a part Of daily life, 
a~ld u working gmlmuar has to Ilandie it. By "handle it" we 
~'Leau no~. grind to l* ilalt, but figu~ out some kind of appro- 
priate ana|ysls and then comment, if possible~ on whatever is 
d~fticnlt or mnmual, it' real-lit'e natural language processing 
i~: gnltig to c~ist, the~ must be anme way to exla~t meaning 
e~en t~xa'~ s|dnga that violate cnstommy syntactic mien, that 
a,e exc,.:ssiveiy ~oug and complex, and that are net sentences 
\[J~: ,-ill. 
At ~BM Re, arch, we are developing a broad-coverage 
l~.ar~,de.g granlmar for English, called tile PLNLP EnglLsh ~raln- 
mar, og PEG. Its initial sylUactic component works only with 
limited infomratiou - lexical featm~es for pails of speech, for 
mo~phologic~d stmelme, and for some valeucy classes. This 
colnpunetlt tries to assilpl some n~asonable st~xtcture to any 
input siting of English. 
Even iu its Cnrk~3,nt be~iflntng 8tale, PEG has proved to 
be ~t' considerable ~sefolness for a lather wide valiety of real- 
world kWLP taskz. Its main use so fitr has been as the pin, lug 
C31kLpolleut of CRITIQUE. a large-scale natural language text 
pr~gx~s'ding systetn that identifies grammar and style errors in 
~-;ll,tglis|i text (Iqeido~u et at. i982, Richatdann and Bradeu- 
k\[md¢~' 1988). A pt-ototype cxrrlQuE system is UOW fmLc- 
t~mling hi thr~ major ,'qtplicatitm areas: business offices, a 
paldislliug center aLrd univeaalries. 
Real-world natutM language processing nmst deal with 
huge amounts of data, which involve many, and messy, details, 
lf~or example, ~mnctuallon is very impmlant in processing real 
t!~X~, but cm~nt liuguistic theorios have nothing substantial 
to say about puuctuatinlL. Nor have they anything substan- 
tial to ~ay about \[aialysls slructures for ellipsis, or for strings 
that deviate ill various degrees frmn tim canmiical order of the 
l;mguage ill which they OCCllr. Here is the kinti of natural lan- 
guage ilqmt that CRITIQUI?; has to deal wilh. (All of tile text 
excerpts below are wrilten EXAC~I'LY ~Ls they were produced.) 
Fixzt, a memo that was sent via electronic matt to multiple 
r, sers ill the of/ice envil'onnLe|tt: 
(1) Over tile comse of tile next couple of days tile 
accouoting (lepartment will cooducting inventory of 
labs and offices here at X~-L-~X. I they are currently 
workiug on file tirst floor, ~unl woddng tirere way niL. 
If you are ilOt hi yore" oflice and do not plan to be 
there withiu the uext few days,please secme all con- 
lidcntial mail tuLd items you may have of confidential 
ualuL'e. Because if you .are LInt tiLere accontlting is 
going to go iu and inventory your equipment. 
Tile author of text (1) is a ualive speaker of American 
Euglish, wile ilas a college edncatimi and is employed in a 
.position of some responsibility ill a large business firm. Note 
2ire following problems: 
(a) "will Collducting" should t~ "will conduct"; 
(b) "conducting inventory" should be "onuducting an 
i*wentory"; 
(c) "l 6~ey" should he just '"l~ey"; 
(d) "worl~ing the~e way up" shmdd be "working their 
way up'; 
(e) "days,please" lacks a space hetweeu tim comma 
and "please"; 
(i) "of confidential nature" wo~ld be better written 
~'; "of a conridential nature"; 
(g) The last text segmeut is a fragment, not a com- 
plete clause, although it is p~esented as if it were 
i~ seutetlC~,. 
No liieo~eticaliy pure grammar wmtid ever be able to ann 
• ,dyz, e text like this. It may be objected that "granlmar" defines 
tire competeuce thai makes it possible for us to identify mis- 
t~es (a - g), aml that any working system is an embodlmeot of 
a kind of performauce, not competence. Very well; note then 
that the role of "gra~nmar" becomes that of a COMMEbrYARY 
ou tile analysis strnctuce, NOT the definition of tile structure 
itself. This is exactly the point. It may be timt we need a new 
defiuiiiou of tile teri~l "gra1~mar." 
Within the educational environment, the ch~dienge for a 
computational gmmmtn" is even stronger. Followhig are two 
excerpts from essays by non-native English speakers. Text 
(2) is an extreme exanLpte of tile ron,-on style of writing; the 
interesting "grammatical" question is what cues might be used 
to divide this text into separate sentences: 
(2) After the analysis of three graphs we can 
make conclusion. From 1940 to 1980 the farm pop- 
station and farms decrease but the average farm siz~e 
increase, this tendency shows American don't have 
strong intensie to work on the farms, as a result it is 
impossible to increase the farms but when The peo- 
ple who would like to work ou farms expand their 
f,'um size by themselves or the aid of government; 
maybe some other agents want to invest capital in 
tim "farming industry". 
Text (3) shows interesting problems with tile definite ar- 
ticle (mass vs. count NPs) and with auxiliaries in VPs: 
(3) So we know, now we can use tile fewer peo- 
ple to get the more food. Is the decreasing farmer 
we deduce on tile graph7 Is the farms going to tie- 
creasing in luture7 Does the average of farm size 
will develope7 No. No. No. 
The problem of non-"grammaticality" is pervasive hi real 
language use. The question 
(4) Who did you tell me that won? 
supposedly poses an cxtraclinu problem - in terms of Gov- 
ernment Bhtding Theory, it violates tire Empty Case Principle. 
Yet it can be heard from the mouths of people who wonkl o111- 
etwlse qualify as speakers of Standard English. The sentence 
(5) He bought for ten shillings a ring. 
supposedly violates an ordering constraint ill English because 
the prepositional phrase "for ten shillings" precedes the direct 
object "a ring.'~ However, as the direct object NP becomes 
heavier and heavier, the sentence sounds better and better: 
(5') He bought for ten shillings a ring that de- 
righted the woman who had previously been pro- 
posed to by millionaires. 
To move "for ten shillings" to a position following tim direct 
object in (5') would be extremely awkward. Ill this case, it is 
better to interpret tile "granrmalical" ordering role as a stylistic 
commem. The consm~otion 
(6) Hlmself's father came. 
violates theoretical restrictions on aaaphora, or Binding; but 
it is fine if mad with an Irish flavor. And the alternative of 
having a completely separate grammar for Irish English is not 
appealing. The sentence 
(7) Site be happy. 
is censured because the main verb is not tensed; but (7) is valid 
Non-standard Black English. And so on. Many theoretically 
proscribed sequences exist and flourish as stylistic or social 
variants. To ignore them, and to l~rsue the Holy Grail of a 
grammar that describes "all and only" the grammatical strings 
of a language, would be to defeat the enterprise of broad- 
coverage computational parsing. 
Furthermore, it is not uecessary to enforce all of the 
supposedly "grammatical" restrictions within a computational 
analysis grammar that actually deals with quantities of real 
text, fil real flare. Our experience with PEG, in rile CRITIQUE 
application, proves this. PEG produces appropriate parses for 
(4) - (7). Then a Style component can comment on the parses, 
calling attention to whatever problems or variations exist. We 
do not cut~ntly handle all of the difficulties posed by (1) - 
(3), but we do handle some of them. For those grammatical 
restrictions that have to be enforced within the syntactic gram- 
mat' (such as number agreement), we have a two-pass error 
detection and co ffecrion strategy. For massive problems like 
the nm-ons in (2), we use the technique of the "fitted parse," 
which tries to identify sensible chunks of text and present them 
ill genre reasonable framework. 
Since it is neither desirable nor necessary for a compuo 
tational grammar to define "all and only" the "grammatical" 
sequences of a language, and since working computational 
grammars are the most comprehensive descriptions that we 
can come up with, right now, for natural languages, we suggest 
that the goal of real-world granlmatical analysis be re-defined: 
a grammar should try to describe "all," but not "only," the 
grammatical strings of a language. 
449 
