SPECLkLIZEI) \[NFOKMATION EXTRACTION: AUTOMATIC CHEMICAL RFACTION CODING FROM ENGLIS}t DF.SCR£?'r\[ONS 
Larr,, H. Reeker* Elena M. 7.~ lura** and Paul E. Blower 
Chemical Abstracts Service 
2540 Olentangy River Road 
P.O. Box 3012 
Columbus, Ohio 43210 
ABSTRACT 
In an age of increased attention to the 
problems of database o rganlzat ion, retrieval 
problems and query languages, one of the major 
economic problems of many potential databases 
remains the entry of the original information into 
the database. Specialized information extraction 
(SIE) systems are therefore of potential impor- 
tance in the entry of information that is already 
available in certain restricted types of natural 
language text. This paper contains a discussion 
of the problems of enElneering such systems and a 
description of a particular SIE system, designed 
to extract information regarding chemical reac- 
tions from experimental sections of papers in the 
chemical literature and to produce a data struc- 
ture containing the relevant information. 
I. INTRODUCTION 
A. Overview of the Paper 
In an age of increased attention to the 
problems of database organization retrieval 
problems and query languages, one of the major 
economic problems of many potential databases 
remains the entry of the original information into 
the database. A large amount of such information 
is currently available in natural language text, 
and some of that text is of a higi~ly stylized 
nature, with a restricted semantic domain. It is 
the task of specialized information extract ion 
(SIE) systems to obtain information automatically 
from such texts and place it in the database. As 
with any system, it is desirable to minimize 
errors and human intervention, bur a total absence 
of either is not necessary for the system to be 
economically viabl~. 
* Current address: Department of Computer 
Science, Tulane University, New OrleanS, Louisiana 
70118, 
** Current address: P.O. Box 3554, Gaithersburg, 
Maryland 20278. 
109 
In ~his paper, we will first discuss some 
general characterlstics of SIE systems, then 
describe the development of an experimental system 
to assist in the coastructlon of a database of 
chemical reaction information. Many Journals, 
such as the Journal of Or~anlc Chemistry, have 
separate experimental sections, in which ~he 
procedures for preparing chemical compounds are 
described. It is desired to extract certain 
information about these reactions and place it in 
the database. A reaction information form (RIF) 
was developed in another project to contain the 
desired information. The purpose of the system is 
tO eliminate the necessity in a majority of 
cases, for a trained reader to read the text and 
enter the P.IF information into the machine. 
B. Some Terminology 
In the discussion below, we shall use the 
term ~rammmr to mean a system conslstin 8 of a 
lexicon s a s~ntax, a meanin~ representation lan- 
guage, and a z~ntlc mapping. The lexicon 
consists of the ~st of words in the language and 
one or more grammatical categories for each word. 
The syntax specifies the structure of sentences in 
the language in terms of the grammar ical 
categories. Morphological procedures may specify 
a "syntax" within classes of words and thereby 
reduce the size of the lexicon. A discourse 
structure, or extrasentential syntax, may also be 
included. 
The semantic mapping provides ~oc each 
syntactically correct sentence a meaning repre- 
sentation in the meaning representation language, 
and it is the crux of the whole system. ~f the 
semantic mapping is fundamentally straightforward, 
then the syntactic processing can often be 
reduced, as well. This is one of the virtues o~ 
SIt systems: Because Of the specialized subject 
matter, one can simplify syntactic process\[n~ 
through the use of ad hoc procedures (either 
algorithmic or heuristic). In many ca~es, t:he 
knowledge that allows this is nonlinguist ic 
knowledge, which may be encoded tn frames. 
Although this is not always the sense in which 
"frame" is used, this is the sense in which we 
shall use the term in our discussion below: 
Frames encode nonltngulstic "expectations" brouRht 
to bear on the task. ~n this light, it is inter- 
esting to ~xplore the subject of case-slot Iden- 
tity, as raised by Charnlak (\[981). if the slots 
are components of framesmand cases are names for 
arguments of a predicate, then the slots in any 
practical language understanding system may not 
correspond exactly co the cases in a language. \[n 
fact. the predicates may not correspond to the 
frames. On the other hand, if the language is 
capable of expressing all of the dlstinctio~m that 
can be understood in terms of the frames, one 
would expect them to grow closer and closer as the 
system became less specialized. The decision as 
to whether to maintain the distinction between 
predlcat e/case and frame/slot has a "Whor flan" 
flavor to it. We have chosen to maintain that 
dis c Inct ion. 
Despite the general decision with regards 
to predicates and slots, some of the grammatical 
categories in our work do not correspond precisely 
to conventional grammatical categories, but are 
specialized for the reaction information project. 
An example is "chemical name", This illustrates 
another reason that SIE systems are more practical 
than more general language understanding systems: 
One can use certain ad hoc categories based upon 
the characteristics of the problem (and of the 
underlying meanings represented). This idea was 
advocated several years ago by Thompson (\[966) and 
used in the design of a specialized database query 
system ( DFACON). Its problem in more general 
language processing appllcac ions - that the 
categories may not extend readily from one domain 
to another and may actually complicate the general 
grammar - does not cause as much difficulty in the 
SIE case. The danger of using ad hoc categories 
is, of course, that one can lose extensibility, 
and must make careful decisions in advance as to 
how specialized the SIE system is going to be. 
II. SPECIALIZED INFORMATION EXTRACTION 
A. Characteristics of the SIr Task 
The term "specialized information extrac- 
tion" is necessarily a relative one. Information 
extraction can range from the simplest sorts of 
tasks like obtaining all names of people men- 
Cloned in newspaper articles, to a full under- 
standing of relatively free text. The simplest of 
these require of the program little linguistic or 
empirical knowledge, while the most complex 
require more knowledge than we know how to lye. 
But when we refer to an SIE task, we will mean one 
that." 
(l) Deals with a restricted subject 
matter 
(2) Requires Information chat can be 
classified under a limited number of 
discrete parameters, and 
(3) Deals with language of a specialized 
type, usually narrative reports. 
SIE programs are more feasible than 
automatic translation because the restrictions 
lessen the ambiguity problems. This is even true 
in comparison to other tasks with a restricted 
subject matter, such as natural language computer 
programmln 8 or database query. Furthermore, these 
latter tasks require a very low error rate in 
order co be useful, because users will not 
tolerate either ineorrect results or constant 
queries and requeSts for rewording from the 
program, while SIr programs would be successful if 
they produced results in, say, 80% of cases and 
required that the information extraction be done 
by humans In the ochers. Even small rat:es of 
undetected errors would be tolerable in many 
sicuatlons, though one would wish co minimize 
them 
The lessened syntactic variety in SIE 
tasks means that the amount of syntactic analysis 
needed is lessened, and also the complexity of the 
machinery for the semantic mapping. At the same 
time, the specialized semantic domain allows the 
use of empirical knowledge to increase the 
efficiency and effectiveness of analysis proce- 
dures (the lessening of ambiguity being only one 
aspect of this). 
The particular cases of SIr that we have 
chosen are highly structured paragraphs, describ- 
ing laboratory procedures for synthesizing organic 
substances which were taken from =he experimental 
section of articles in J. Or\[. Chem. Our feeling 
is that the full text of chemical articles is 
beyond the state of the SIr art, if one wants to 
extract anything more than trivial information; 
hut the limited universe of discourse of the 
experimental paragraphs renders SIr on them 
fens i b le. 
llO 
B. The En~lneerins of SIE STsteaw 
Since the days of the early mechanical 
translation efforts, the amount of study of 
natural language phenomena, both from the point of 
view of pure theory and of determining specific 
facts about languages, has been substantial. 
Similarly, techniques for dealing with languages 
and other sorts of complex information by computer 
have been considerably extended and the work has 
been facilitated by the provision of higher-level 
programming languages and by the availability of 
faster machines and increased storage. Never-- 
chelsea, the state of scie~eific knowledge of 
language and of processes for utilizing that 
knowledge is still such that it is necessary Co 
take an "engineering approach" to the design of 
comp~attoual linguistics systems. 
In using the term "engineering", we mean 
to indicate that comprouises have to be made in 
the design of the system between what is theoreti- 
cally desirable, and what is feasible at the state 
of the art. Failing to have a complete grammar of 
the language over which one wishes to have STE, 
one uses heuristics to determine features that one 
wants. At the same time, one uses the scientific 
knowledge available, insofar as that is feasible. 
One builds and tests model or pilot system to 
explore problems and techniques and tries co 
extrapolate the experience to production systems, 
which themselves are likely ~o have to be 
"incrementally developed". 
In any engineering rout ext, evaluation 
measures are important. These measures allow one 
Co set criteria for acceptability of designs which 
are likely always to be imperfect, and to compare 
alternative systemS. The ultimate evaluation 
~easure on which management decisions rest is 
usually cosc/benefi~ ratio. This can be decer- 
mined only after examining the h,~an alternatives 
and their effectiveness. It is important to be 
able Co quantify these alternatives, and this is 
often not done. For instance, it is common to 
assume chat an automaclc system should not produce 
errors, whereas humans always do; so the percent- 
age of errors should be determined experimentally 
in each case and compared. 
For the evaluation of SIE systems, we 
would like to propose three measures; 
(1) Robustness - the percentage of 
inputs handled. Most real SIE sys- 
cm will reject certain inputs, so 
the rob~tness ~rlll be one minus the 
parc4n~ags rejected. 
(2) Accuracy the percentage of 
those inputs handled which are cor- 
rectly handled. 
(3) Error rate - the percentage of 
erronao~ entries within incorrectly 
an handled input. 
Probably the most difficult aspect of SIE 
enginearinE is the provision of a safety factor - 
an ability of the system to recognize inputs that 
it cannot handle. It is clear that one can create 
a system that is robust and acceptably accurate 
which has unacceptable error rates for certain 
inputs. If the system is to be useful, it must be 
possible aueOmaeicall~ to determine which docu- 
ments contain unacceptable error rates. It does 
no good to determine this manually, since chac 
~ould nan assentially redoing all of the infor- 
: scion extraction m~nual 1 y, and the space of 
'doc, ments Is not sufficiently uniform or con- 
tinuous thaC sampling methods would do any good. 
It appears, then, that the only way that one is 
going to be able co provide a safety factor is to 
have a system chat understands enough about the 
linguistic and nonllngulstlc aspects of the texts 
to know when it is not understanding (at least 
most of the time). We shall have more co say 
about the safety factor when we discuss our system 
below. 
One suggestion often made for "int el- 
l Igenc" systems is thac they be given some 
provision for improving their performance by 
"learning". Generally the problem with chls 
suggestion is chat the complexity of the learning 
process is greater than chac of the original 
system, and it is also unclear in many cases what 
the machine needs to learn. It nevertheless seems 
feasible for SIE systems to learn by Interaction 
with people who are dolng information extraction 
tasks. The simplest case of this would be .~u 8- 
mentinK the lexicon, but ochers should be pos- 
sible. The first step in chls process would be co 
III 
build in a sufficient safety factor that most 
incorrectly handled doc~anents can be explicitly 
rejected. The second would be Co localize the 
factors that caused the rejection sufficiently to 
be able to ask for help from the person doing the 
manual extraction process. Although we have 
considered this aspect of SIE development, we have 
oct made any attempt to implement it. 
A. The Description of ChmLtcal Reactions 
A particular task that would appear to be 
a candidate for STY, under the criteria given 
above, ls the extraction of information on chemi- 
cal reactions from experimental sections of chemi- 
cal Journals. The Journal chosen for our 
experimental work was the Journal of Or~anlc 
Chemistry,. Two examples of reaction descriptions 
from this Journal are shown in Figure 1. Both of 
these examples have a particular type of discourse 
structure, which we have called the "simple 
model". The paragraphs in the figure (hut not in 
the actual texts) are divided into four com- 
ponents: a heading, a synthesis, a work-upm and a 
characterization. Usually, the heading names the 
substance that is produced in the reaction, the 
synthesis porclon describes the steps followed in 
conducting the reaction, the work-up poL~lon 
describes the recovery of the substance from the 
reaction mixture, and the characterization portion 
presents analytical data suppoL~Ing the structure 
assignment. Most of the information that we wish 
to obtain Is in the synthesis port l on, which 
describes the chemical reactants, reaction con- 
dltlons and apparatus. 
Figure 2 shows the Reaction Information 
Form (R/F) designed to hold the required reaction 
information, with information supplied for the two 
paragraphs illustrated in Figure 1. One point to 
notice is that not every piece of data is con- 
tained in every reaction description. Thus there 
are blanks Ln both examples, corresponding to 
\[nformar~.~,~ l~Ct u~speclfled in the corresponding 
r'~.rt~.~* des~:rlptions (those shown in Figure L). 
B. An ~I~ S~5t~n for Reaction Information 
\[. General Or~anlzation 
The chemical reaction SIE is written in 
PL/I and runs on a 370/168 under TSO. The t~stlng 
of certain of the algorithms and heuristics has 
been done using SNOBOLA (SPITBOL) running under 
UNIX on a POP LI/70. The choice of PL/I on the 
370 was dictated by practical considerations 
involving the availability of textual m~Cerl~l, 
the unusual format of that material, and the 
availability of existing PL/I routines to deal 
with that format, 
The prosraum comprising each stage of the 
system are implemented modularly. Thus the lexi- 
cal stage involves separate passes for individual 
lexical categories. In some cases, these are not 
order-lndependent. In the syntactic phase, the 
individual modules are "word experts", and in the 
last (extraction) phase, they are individual 
"frames" or components of frames, 
2. The Lexical StaGe 
In the lexical stage, both dictionary 
lookup and morphological analysis are used to 
classify words. Morphological analysis procedures 
include suf fix normalization, stemming and root 
word lookup and analysis of internal p,mc~itation. 
Chemical substances may be identified by complex 
words and phrases, and are therefore surprisingly 
difficult to isolate& 
Both lexical and syntactic means are used 
to isolate and tag chemical names. In the lexlcal 
stage, identifiable chemical roots, such as "benz" 
and terms, such as "Iso-" are tagged. In the 
syntactic stage, a procedure uses clues such as 
parenthetical expressions, internal commas and the 
occurrence of Juxtaposed chemical roots to iden- 
tify chemical names. This is really morphology, 
of course. It also uses the overall syntax of the 
sentence to check whether a substance name is 
expected and to dellmlt the chemical name. 
3. The Syntactic Stase 
Chemical substances which comprise the 
reactants and the products of a chemical reaction, 
as well as the reaction conditions and yield, are 
identified by a hierarchical application of proce- 
dures. The syntactic stage of the system has been 
implemented by application of word expert proce- 
dures to the data structures built durittg the 
lex~cal stage. 
The word experts are based upon the !~s 
of Rieger and Small (1979) but It has not h 
found to be necessary to ';~? the full complexity 
of their model, so this system's word expels have 
LI2 
N-l-Mot kyl-|,6-dlkydM- ! ,4:44,10b-dletkeaebonso(/1- 
I)Mkabudae.~t( 1H,4 H)<UearbatmJ~ (~a). 
A ~utkm d 
h ia s ¢1 p~o~me/~thyl rote mlzmu~ (I00 mL) mo~l la s 
dry im/~-Wopm~ b~ ,m u~d ~ m~ • sd~on d 
N-methy~ (1.24 8. 11.0 retool) in ethyl into (2O 
aLL). "rbe mect~a nhxun'e ~m a~usd to wenn to n~m t4m. 
perstu/ 
~d the im~pitatad sdduc~ ~ co~cu~d to ~ 2.70 
| (7"t%) o( utmmie 7a m e light piak redid, mp t93-194 "C. 
m,d~mi mmpb ~m ob~med in colm4m ~ by ~ 
taU~m b~m ~mm~/cyeob,m~ mp 19~19~q ~C; O~ (liar) 
• ~ 3100-2820. 1780. 1710. 1460. 1400. 1220. 1200. 790 taxi 780 
em-c. 'H ~ (~ ~ 7.8-7.0 (m. 4 H), (L4-(L15 (m, 2 H). &99 
(~ j i, 3 Hs, 1 H). 5.73 (d. J ', 3 1~. I H)0 &62-6.~ (m. I H). 
4.83-4.60 (ns. t H). 3.0-2.6 (m. 2 H). 2.90 (~ 3 H). 2.2-1.6 (m. 2 
H); mare q~wum, adai m/e 319.1320. obed 319.1324. 
Am~ ~ for C.bI.N~O~ C, 7L~ It 5.37; N. ~&la Fo.m~ 
C, 7Ll~ 14, ~1; N. 12.85. 
Figure 1. 
model. 
S Illt.l I~O!1. 
!. ~lnl 
1. SyI~II~III I I 
4. Chil~lC tel" t IIIt Oi~ 
reaction descrlpcions, divided Co show 
N-|. Metkyi-$,|-dikydro-7,10-d|met byi- 1.4:4t, 10b.di. 
e~m~mum( t~ktha/mda~:~K 1H~ H)<lJ~zlms~ld~ (Tb). 
To Ib (:1.42 I, 10.4 m¢4) dlmoNed in ~0 ~ o( cold (-'78 *C) 4:1 
pentene/e~byl ~e~te wu added dropwise ~-methyl- 
~{~moU~U~m (1.1T I. 10,4 mm~) dimolv~d in sthyl rote (14 
mL). A4~41t t~ I~t~o~ mill ~m~lbed, t~e m~-t4o, mu\[l~m ~ml 
The dilhtly pb:d( ml/d wla 
~/ta ~ mp 
~O.-2=2 "C; lit O(~r) ~ 3100-2~00.1770,170~. 1460. I~SO. 1380. 
1190, MO, S00, 78~ 740. 606 ~-'; ~H NMR (CDCI~) A 6.97 (,. 2 
H). &4-qLI (~ 3 M), 6.12 (d. J m 2.9 H~ 1 H), 5.10-4.75 (m, 2 
H), 3.10-.:L?s (m. 2 H). Z~ (s. 3 H), 2.,5-1.9 (m. 2 H), 2.36 (s. 3 
\]HI). I~3 ~ 3 H); :C ~ (CDC~) IM.8 (a), 1~8.4 (s), 143.4 (d), 
lmA (d), 137.? (,). 13¢1 (,), 13~6 (,). I~LI (0), 129.1 (d), 128.9 
(d). 1~7 (d). 128.2 (d). 60.'7 (2C. 2 d). 50.8 (s). ~.8 (e). 28.5 (t). 
~.l (O, 21L3 (q). 2~l~ (qj, gO.3 ppm (q); m~m ~pecu~m, adcd m / e 347.1434, obxl S4?.le42. 
AmL Cdalf~C~HuN~0~: C,72.~,FL &m. F~m~ C,72.71: 
componem:s of the s~mple 
REF SCALE O&rl, I smlll 
7XME EN(RGY 
COOling 
'R~G NO FUNCTION 
78~24"G2-1 OrOOuCt 
7862~-GI'0 r~ac¢&P¢ 
f32T4-43-6 P~iccan¢ 
SO,vent 
SOlven~ 
... ! 
PHASE 
IOltd 
APPARATUS 
&MT 
2.70 O 
1.2A g 
80 mt 40 mL 
vXELD TEM~ 
77 :; -78 to 20 
FEATURES 
XR. NMR. MS 
AUTHOR l0 
T8 6a 
N-me~ylfPilzoltrleoto~e De,cane 
ethyl ~cl~a~o 
REF 
TZME 
REG NO 
78624-62-I 
7862J-6~-0 
13274-43-6 
StAkE $malI 
EN(RGY 
FUNCTION 
~rodu¢~ 
Solvent SOlvent 
APPARATUS FEATURES 
NMR. IQ. ~S 
AMT AUTHOR !~ 
2.38 g 7~ 
2.~2 9 6~ 
t.1? 9 N-me~yl~r.azO!~nio~ont 40 mL Den~ane 
2J mL e~n~l ace~a:a 
I 
F£gure 2. Two reaction £nformation {otis, produced (manually) £rom {:he descrtp- 
Cions of Figure i. 
i13 
turned out to resemble a standard procedural 
implementation (Wlnograd, 1971) (based mostly on 
particular words or word categories, however). 
Their function is to determine the role of a word 
taking lexical and syntactic context into con- 
sideration. The word expeL'l: approach was ini- 
tially chosen because it enables the implemen- 
tation of fragments of a grammar and does not 
require the development of a comprehensive gram- 
mar. Since irrelevant portions can be identified 
by reliable heuristics and eliminated, this 
attribute is partlcularly useful in the SIE con- 
text. The procedures also allow the incorporation 
of heuristics for isolat Ing cer~aln items of 
interest. 
In this context, it might be maintained 
that the interface between the syntax and the 
semantic mapping is even less clean than in cer- 
tain other systems. This is intentional. BecauSe 
of the specialized nature of the process, we have 
implemented the "semantic counterpar~ of syntax" 
concept, as advocated by Thompson (1966), where we 
judged that it would not impair the generality of 
the system within the area of reaction descrip- 
tions. We have tried not to make decisions that 
would make it difficult to extend the system to 
descriptions of reactions that do not obey the 
"simple model". The advantages of this approach 
were discused in Section I. 
The system pays particular attention to verb 
arguments, which are generally marked by 
prepositions This "case" type analysis gives 
pretty good direct clues to the function of items 
within the meaning representation. Sentencu 
structure is relatively regular, though extraposed 
phrases and a few types of clauses must be dealt 
with. Fortunately, the results, in terms of 
function of chemicals and reaction conditions, are 
the same whether the verb form is in an embedded 
clause or the ~ain verb of the sentence. Zn other 
words, we do not have to deal with the nuances 
implied by higher predicates, or with implicative 
verbs, presuppositions, and the llke. 
4. The Semantic Sta~e 
The semantic mapping could be directly to 
the components of the reaction information form, 
and that is the approach that was implemented in 
the first programs. This gave reasonable results 
in some test cases, but appeared co be less exten- 
slble to other models of reaction description than 
IL4 
was desirable. A SNOBOL4 version maps the syntax 
to a predicate-arg,.ment formalism, with a case 
frame for each verb designating the posslbte 
~rguments for each predicate. 
5, The Extraction Sta~e 
The meaning representation gives a pretty 
clear indication of the function of items within 
the RIF in the simple model. Since we wadted to 
experiment wlth generality in this system, we 
wished to separate general knowledge from linguis- 
tic knowledge, and for that reason, the actual 
extraction of items is done using the frame tech- 
nique (Minaky, 1975; Charniak, 1975). 
In the literature, frames and similar 
devices vary both in their format and in their 
function. Tn some cases, the information that 
they encode is still linguistic, at least in part. 
We are using them in the "nonllngulstlc" sense, as 
discussed in Section I. ~n our system, frames 
encode the expectations that a trained reader 
would brin E to the task of extracting information 
from synthetic descriptions, involving the usual 
structure of these descriptions. 
A frame is being developed initially for 
the simple model. This frame looks for the syn- 
thesis section, dlsc~ rd ing work-up and charac- 
r ,~ '.? ': j~l,~,\[ ~. -} ~;',~ .j:jv~.-,.I~j. ~.L r.hen focuses on 
the synthesis, whe -~' subframes correspond to the 
particular entrle~ :~eeded in the RIF. 
As one example, the "time" frame expects 
to find a series of re~=tlon step times in the 
description. These are already labelled "time", 
and the frame will know that it has to total them. 
making approximations of such time expressions as 
"overnight" and indicating that the total tS then 
approximate. Another example is the "temperature" 
frame, which expects a series of temperatures, and 
must calculate the minimum and maximum, in order 
to specify a range. Again, a certain amount of 
specialized knowledge, such as the temperature 
indicated by an ice water bat~, is necessary. 
C. Evaluation of the S~s~em 
As of the date of this paper, we have only 
experimented with the version of the system that 
maps directly from the syntax into componu.: ~ 
the reaction coding form. As noted above, this 
version does not have the generality that we 
desire, but gives a pretty good indication of the 
capabilities of the system, as now Implemented. 
Am a test of the system, we ran it on 
fifty synthetic paragraphs from the experimental 
sections of the Journal of 0rsanic Ch,~stry, and 
thirty-six were processed satisfactorily. Four 
had clear, detectable problems, so the robustness 
was 92%, but the accuracy was only 78%, since ten 
of the paragraphs did not follow the simple model, 
and were nevertheless processed. Since these were 
full of errors, we did not try to compute a figure 
for average error rate. 
Although the objective of building this 
experimental system was only to deal with the 
simple model, the exercise has made clear to us 
the importance of the safety factor in making a 
system such as thls useful. We intend to continua 
work with the present system only for a few weekS, 
meanwhile considering the problems and promises of 
extending it. 
fall within chi.~ paradigm include one constructed 
by the Operating Systems Division of Logicon 
(Silva, Montgomery and Dwiggins, 1979), which aims 
tO "atodel the cognitive activities of the htanan 
analyst as he reads and understands message text, 
distilling its contents Into information items of 
internst to him, and building a conceptual model 
of the lnformgtion conveyed by the meBsase," In 
the area of missile and satellite reports and 
aircraft activittu. Another project, at Rutgers 
University, Involva the analysis of case descrip- 
tions concerning glaucoma patients (Ci esi elski, 
1979), and the most extensive SIE project, also in 
the medical area, is that of the group headed by 
Naom£ Sager (1981) at New York University, and 
described in her book. 
IV. RELATION TO SOME OTHER SIE SYSTEMS 
The problem chat we have had concerning 
the safety factor is one chat is likely to be 
found in any $IE system, but i¢ is soluble we 
feel. Even though we have not completed work on 
this experimental system as of the time of writing 
this paper (we have found more syntactic and 
semantic procedures ro be implemented), we already 
have ideas as ¢o how to build in a better safety 
factor. Generally, these can be characterized as 
using some of the information chat can be gleaned 
by a comblnat ion of llnguls tic and chemical 
knowledge which we had ignored as redundant. 
While It is redundant in "successful" cases, it 
produces conflicts tn other cases, indicating that 
something is wrong, and that the document should 
be processed b? hand. 
If the safety Eactor can be improved, SIE 
systems offer a promising area of application of 
computational £ingulst tcs rechnl ques. Clear\[?, 
nothing less than computational linguistics tech- 
niques show any hope of providing a reasonable 
safety factor -or ever adequare robustness and 
accuracy, 
The promise of the SIE area has been 
recognized by other researchers. Systems that 
ItS 

REFERENCES 

Charnlak E. (1975), Organization and Inference 
in a Framel£ke System of Common Sense Knowledge. 
In R. C. Schank and B. L. Naah-Webber, e4s., 
Theoretical Issues in Natural Lan.La._.~u.a~e.Proces- 
sln~, Mathe~aatlcal Social Sciences Board. 

Charnlak, E. (1981). The Case-Slot Eden\[ ity 
Theory, Co~nltlve Science, 5, 28~-292. 

Ciesielski, V. B. (1979). Natural Language Input 
to a Computer-Based Glaucoma Consultation System, In Proceedings, 17th Annual Meeting of the Association for Computational Linguistics pp. 
103-107. 

Minsky, M. (1975). A Framework for Representing 
Knowledge. In P. Winston, ed., The Psychology 
of Computer vision, McGraw-Hill, New York. 

Rieger, C., and S. Small (1979). Word Expert 
Parslng. In Proceedings. Sixth International 
Joint Conference on Artificial Intelli~ence , 
Tokyo, 1979. 

Sager, N, (1981). Natural Language Information 
Processing, Addlson-Wesley, Reading, Mas- 
sachusetts. 

Silva, G., C. Montgomery and D. Dwiggins (1979). 
An Application of Automated Language Understanding Techniques to the Generation of Data Base 
Elements, In Proceedings 17th Annual Meeting of 
the Association for Computational Linguistics, 
pp. 95-97. 

Thompson, F. B. (1966). English for the Computer. 
In Proceedings, Fall Joint Computer Conference, 
Spartan Books, Washington, D. C. 

Winograd, T. (1972). Understandln~ Natural 
Language, Academic Press, New York. 

Zamora E. M., and L. 8. Reeker (1982a). Com- 
putational Linguistics Research Project (Automatic 
Reaction Coding From English Descriptions), Lexi- 
cal Phase (Tasks i, ~.I, 2.2, 2.3, 2.4), Chemical 
Abstracts Service, March. 1982. 

Zamora, E. M., and L. 8. Reeker (1982b). Com- 
putational Linguistics Research Project (Automatic 
R~actton Coding From English Descriptions), Syn- 
tactic Phase (Tasks 2.5, 2.6), Chemical Abstracts 
Service, July, 1982. 
