File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/83/a83-1019_metho.xml
Size: 26,334 bytes
Last Modified: 2025-10-06 14:11:28
<?xml version="1.0" standalone="yes"?> <Paper uid="A83-1019"> <Title>SPECLkLIZEI) \[NFOKMATION EXTRACTION: AUTOMATIC CHEMICAL RFACTION CODING FROM ENGLIS}t DF.SCRPS?'r\[ONS</Title> <Section position="3" start_page="0" end_page="109" type="metho"> <SectionTitle> I. INTRODUCTION </SectionTitle> <Paragraph position="0"> A. Overview of the Paper In an age of increased attention to the problems of database organization retrieval problems and query languages, one of the major economic problems of many potential databases remains the entry of the original information into the database. A large amount of such information is currently available in natural language text, and some of that text is of a higi~ly stylized nature, with a restricted semantic domain. It is the task of specialized information extract ion (SIE) systems to obtain information automatically from such texts and place it in the database. As with any system, it is desirable to minimize errors and human intervention, bur a total absence of either is not necessary for the system to be economically viabl~.</Paragraph> <Paragraph position="1"> * Current address: Department of Computer Science, Tulane University, New OrleanS, Louisiana 70118, ** Current address: P.O. Box 3554, Gaithersburg, Maryland 20278.</Paragraph> <Paragraph position="2"> In ~his paper, we will first discuss some general characterlstics of SIE systems, then describe the development of an experimental system to assist in the coastructlon of a database of chemical reaction information. Many Journals, such as the Journal of Or~anlc Chemistry, have separate experimental sections, in which ~he procedures for preparing chemical compounds are described. It is desired to extract certain information about these reactions and place it in the database. A reaction information form (RIF) was developed in another project to contain the desired information. The purpose of the system is tO eliminate the necessity in a majority of cases, for a trained reader to read the text and enter the P.IF information into the machine.</Paragraph> <Paragraph position="3"> B. Some Terminology In the discussion below, we shall use the term ~rammmr to mean a system conslstin 8 of a lexicon s a s~ntax, a meanin~ representation language, and a z~ntlc mapping. The lexicon consists of the ~st of words in the language and one or more grammatical categories for each word. The syntax specifies the structure of sentences in the language in terms of the grammar ical categories. Morphological procedures may specify a &quot;syntax&quot; within classes of words and thereby reduce the size of the lexicon. A discourse structure, or extrasentential syntax, may also be included.</Paragraph> <Paragraph position="4"> The semantic mapping provides ~oc each syntactically correct sentence a meaning representation in the meaning representation language, and it is the crux of the whole system. ~f the semantic mapping is fundamentally straightforward, then the syntactic processing can often be reduced, as well. This is one of the virtues o~ SIt systems: Because Of the specialized subject matter, one can simplify syntactic process\[n~ through the use of ad hoc procedures (either algorithmic or heuristic). In many ca~es, t:he knowledge that allows this is nonlinguist ic knowledge, which may be encoded tn frames.</Paragraph> <Paragraph position="5"> Although this is not always the sense in which &quot;frame&quot; is used, this is the sense in which we shall use the term in our discussion below: Frames encode nonltngulstic &quot;expectations&quot; brouRht to bear on the task. ~n this light, it is interesting to ~xplore the subject of case-slot Identity, as raised by Charnlak (\[981). if the slots are components of framesmand cases are names for arguments of a predicate, then the slots in any practical language understanding system may not correspond exactly co the cases in a language. \[n fact. the predicates may not correspond to the frames. On the other hand, if the language is capable of expressing all of the dlstinctio~m that can be understood in terms of the frames, one would expect them to grow closer and closer as the system became less specialized. The decision as to whether to maintain the distinction between predlcat e/case and frame/slot has a &quot;Whor flan&quot; flavor to it. We have chosen to maintain that dis c Inct ion.</Paragraph> <Paragraph position="6"> Despite the general decision with regards to predicates and slots, some of the grammatical categories in our work do not correspond precisely to conventional grammatical categories, but are specialized for the reaction information project. An example is &quot;chemical name&quot;, This illustrates another reason that SIE systems are more practical than more general language understanding systems: One can use certain ad hoc categories based upon the characteristics of the problem (and of the underlying meanings represented). This idea was advocated several years ago by Thompson (\[966) and used in the design of a specialized database query system ( DFACON). Its problem in more general language processing appllcac ions - that the categories may not extend readily from one domain to another and may actually complicate the general grammar - does not cause as much difficulty in the SIE case. The danger of using ad hoc categories is, of course, that one can lose extensibility, and must make careful decisions in advance as to how specialized the SIE system is going to be.</Paragraph> <Paragraph position="7"> II. SPECIALIZED INFORMATION EXTRACTION A. Characteristics of the SIr Task The term &quot;specialized information extraction&quot; is necessarily a relative one. Information extraction can range from the simplest sorts of tasks like obtaining all names of people men-Cloned in newspaper articles, to a full understanding of relatively free text. The simplest of these require of the program little linguistic or empirical knowledge, while the most complex require more knowledge than we know how to lye.</Paragraph> <Paragraph position="8"> But when we refer to an SIE task, we will mean one that.&quot; (l) Deals with a restricted subject matter (2) Requires Information chat can be classified under a limited number of discrete parameters, and (3) Deals with language of a specialized type, usually narrative reports.</Paragraph> <Paragraph position="9"> SIE programs are more feasible than automatic translation because the restrictions lessen the ambiguity problems. This is even true in comparison to other tasks with a restricted subject matter, such as natural language computer programmln 8 or database query. Furthermore, these latter tasks require a very low error rate in order co be useful, because users will not tolerate either ineorrect results or constant queries and requeSts for rewording from the program, while SIr programs would be successful if they produced results in, say, 80% of cases and required that the information extraction be done by humans In the ochers. Even small rat:es of undetected errors would be tolerable in many sicuatlons, though one would wish co minimize them The lessened syntactic variety in SIE tasks means that the amount of syntactic analysis needed is lessened, and also the complexity of the machinery for the semantic mapping. At the same time, the specialized semantic domain allows the use of empirical knowledge to increase the efficiency and effectiveness of analysis procedures (the lessening of ambiguity being only one aspect of this).</Paragraph> <Paragraph position="10"> The particular cases of SIr that we have chosen are highly structured paragraphs, describing laboratory procedures for synthesizing organic substances which were taken from =he experimental section of articles in J. Or\[. Chem. Our feeling is that the full text of chemical articles is beyond the state of the SIr art, if one wants to extract anything more than trivial information; hut the limited universe of discourse of the experimental paragraphs renders SIr on them fens i b le.</Paragraph> <Paragraph position="11"> llO B. The En~lneerins of SIE STsteaw Since the days of the early mechanical translation efforts, the amount of study of natural language phenomena, both from the point of view of pure theory and of determining specific facts about languages, has been substantial.</Paragraph> <Paragraph position="12"> Similarly, techniques for dealing with languages and other sorts of complex information by computer have been considerably extended and the work has been facilitated by the provision of higher-level programming languages and by the availability of faster machines and increased storage. Never-chelsea, the state of scie~eific knowledge of language and of processes for utilizing that knowledge is still such that it is necessary Co take an &quot;engineering approach&quot; to the design of comp~attoual linguistics systems.</Paragraph> <Paragraph position="13"> In using the term &quot;engineering&quot;, we mean to indicate that comprouises have to be made in the design of the system between what is theoretically desirable, and what is feasible at the state of the art. Failing to have a complete grammar of the language over which one wishes to have STE, one uses heuristics to determine features that one wants. At the same time, one uses the scientific knowledge available, insofar as that is feasible. One builds and tests model or pilot system to explore problems and techniques and tries co extrapolate the experience to production systems, which themselves are likely ~o have to be &quot;incrementally developed&quot;.</Paragraph> <Paragraph position="14"> In any engineering rout ext, evaluation measures are important. These measures allow one Co set criteria for acceptability of designs which are likely always to be imperfect, and to compare alternative systemS. The ultimate evaluation ~easure on which management decisions rest is usually cosc/benefi~ ratio. This can be decermined only after examining the h,~an alternatives and their effectiveness. It is important to be able Co quantify these alternatives, and this is often not done. For instance, it is common to assume chat an automaclc system should not produce errors, whereas humans always do; so the percentage of errors should be determined experimentally in each case and compared.</Paragraph> <Paragraph position="15"> For the evaluation of SIE systems, we would like to propose three measures; (1) Robustness - the percentage of inputs handled. Most real SIE syscm will reject certain inputs, so the rob~tness ~rlll be one minus the parc4n~ags rejected.</Paragraph> <Paragraph position="16"> (2) Accuracy the percentage of those inputs handled which are correctly handled.</Paragraph> <Paragraph position="17"> (3) Error rate - the percentage of erronao~ entries within incorrectly an handled input.</Paragraph> <Paragraph position="18"> Probably the most difficult aspect of SIE enginearinE is the provision of a safety factor an ability of the system to recognize inputs that it cannot handle. It is clear that one can create a system that is robust and acceptably accurate which has unacceptable error rates for certain inputs. If the system is to be useful, it must be possible aueOmaeicall~ to determine which documents contain unacceptable error rates. It does no good to determine this manually, since chac ~ould nan assentially redoing all of the infor: scion extraction m~nual 1 y, and the space of 'doc, ments Is not sufficiently uniform or continuous thaC sampling methods would do any good. It appears, then, that the only way that one is going to be able co provide a safety factor is to have a system chat understands enough about the linguistic and nonllngulstlc aspects of the texts to know when it is not understanding (at least most of the time). We shall have more co say about the safety factor when we discuss our system below.</Paragraph> <Paragraph position="19"> One suggestion often made for &quot;int ell Igenc&quot; systems is thac they be given some provision for improving their performance by &quot;learning&quot;. Generally the problem with chls suggestion is chat the complexity of the learning process is greater than chac of the original system, and it is also unclear in many cases what the machine needs to learn. It nevertheless seems feasible for SIE systems to learn by Interaction with people who are dolng information extraction tasks. The simplest case of this would be .~u 8mentinK the lexicon, but ochers should be possible. The first step in chls process would be co III build in a sufficient safety factor that most incorrectly handled doc~anents can be explicitly rejected. The second would be Co localize the factors that caused the rejection sufficiently to be able to ask for help from the person doing the manual extraction process. Although we have considered this aspect of SIE development, we have oct made any attempt to implement it.</Paragraph> <Paragraph position="20"> A. The Description of ChmLtcal Reactions A particular task that would appear to be a candidate for STY, under the criteria given above, ls the extraction of information on chemical reactions from experimental sections of chemical Journals. The Journal chosen for our experimental work was the Journal of Or~anlc Chemistry,. Two examples of reaction descriptions from this Journal are shown in Figure 1. Both of these examples have a particular type of discourse structure, which we have called the &quot;simple model&quot;. The paragraphs in the figure (hut not in the actual texts) are divided into four components: a heading, a synthesis, a work-upm and a characterization. Usually, the heading names the substance that is produced in the reaction, the synthesis porclon describes the steps followed in conducting the reaction, the work-up poL~lon describes the recovery of the substance from the reaction mixture, and the characterization portion presents analytical data suppoL~Ing the structure assignment. Most of the information that we wish to obtain Is in the synthesis port l on, which describes the chemical reactants, reaction condltlons and apparatus.</Paragraph> <Paragraph position="21"> Figure 2 shows the Reaction Information Form (R/F) designed to hold the required reaction information, with information supplied for the two paragraphs illustrated in Figure 1. One point to notice is that not every piece of data is contained in every reaction description. Thus there are blanks Ln both examples, corresponding to \[nformar~.~,~ l~Ct u~speclfled in the corresponding r'~.rt~.~* des~:rlptions (those shown in Figure L). B. An ~I~ S~5t~n for Reaction Information \[. General Or~anlzation The chemical reaction SIE is written in PL/I and runs on a 370/168 under TSO. The t~stlng of certain of the algorithms and heuristics has been done using SNOBOLA (SPITBOL) running under UNIX on a POP LI/70. The choice of PL/I on the 370 was dictated by practical considerations involving the availability of textual m~Cerl~l, the unusual format of that material, and the availability of existing PL/I routines to deal with that format, The prosraum comprising each stage of the system are implemented modularly. Thus the lexical stage involves separate passes for individual lexical categories. In some cases, these are not order-lndependent. In the syntactic phase, the individual modules are &quot;word experts&quot;, and in the last (extraction) phase, they are individual &quot;frames&quot; or components of frames,</Paragraph> </Section> <Section position="4" start_page="109" end_page="109" type="metho"> <SectionTitle> 2. The Lexical StaGe </SectionTitle> <Paragraph position="0"> In the lexical stage, both dictionary lookup and morphological analysis are used to classify words. Morphological analysis procedures include suf fix normalization, stemming and root word lookup and analysis of internal p,mc~itation. Chemical substances may be identified by complex words and phrases, and are therefore surprisingly difficult to isolate& Both lexical and syntactic means are used to isolate and tag chemical names. In the lexlcal stage, identifiable chemical roots, such as &quot;benz&quot; and terms, such as &quot;Iso-&quot; are tagged. In the syntactic stage, a procedure uses clues such as parenthetical expressions, internal commas and the occurrence of Juxtaposed chemical roots to identify chemical names. This is really morphology, of course. It also uses the overall syntax of the sentence to check whether a substance name is expected and to dellmlt the chemical name.</Paragraph> </Section> <Section position="5" start_page="109" end_page="109" type="metho"> <SectionTitle> 3. The Syntactic Stase </SectionTitle> <Paragraph position="0"> Chemical substances which comprise the reactants and the products of a chemical reaction, as well as the reaction conditions and yield, are identified by a hierarchical application of procedures. The syntactic stage of the system has been implemented by application of word expert procedures to the data structures built durittg the lex~cal stage.</Paragraph> <Paragraph position="1"> The word experts are based upon the !~s of Rieger and Small (1979) but It has not h found to be necessary to ';~? the full complexity of their model, so this system's word expels have S Illt.l I~O!1.</Paragraph> <Paragraph position="2"> !. ~lnl 1. SyI~II~III I I 4. Chil~lC tel&quot; t IIIt Oi~ reaction descrlpcions, divided Co show H). &4-qLI (~ 3 M), 6.12 (d. J m 2.9 H~ 1 H), 5.10-4.75 (m, 2 H), 3.10-.:L?s (m. 2 H). Z~ (s. 3 H), 2.,5-1.9 (m. 2 H), 2.36 (s. 3 \]HI). I~3 ~ 3 H); :C ~ (CDC~) IM.8 (a), 1~8.4 (s), 143.4 (d), lmA (d), 137.? (,). 13C/1 (,), 13~6 (,). I~LI (0), 129.1 (d), 128.9 (d). 1~7 (d). 128.2 (d). 60.'7 (2C. 2 d). 50.8 (s). ~.8 (e). 28.5 (t). ~.l (O, 21L3 (q). 2~l~ (qj, gO.3 ppm (q); m~m ~pecu~m, adcd m / e 347.1434, obxl S4?.le42. turned out to resemble a standard procedural implementation (Wlnograd, 1971) (based mostly on particular words or word categories, however). Their function is to determine the role of a word taking lexical and syntactic context into consideration. The word expeL'l: approach was initially chosen because it enables the implementation of fragments of a grammar and does not require the development of a comprehensive grammar. Since irrelevant portions can be identified by reliable heuristics and eliminated, this attribute is partlcularly useful in the SIE context. The procedures also allow the incorporation of heuristics for isolat Ing cer~aln items of interest.</Paragraph> <Paragraph position="3"> In this context, it might be maintained that the interface between the syntax and the semantic mapping is even less clean than in certain other systems. This is intentional. BecauSe of the specialized nature of the process, we have implemented the &quot;semantic counterpar~ of syntax&quot; concept, as advocated by Thompson (1966), where we judged that it would not impair the generality of the system within the area of reaction descriptions. We have tried not to make decisions that would make it difficult to extend the system to descriptions of reactions that do not obey the &quot;simple model&quot;. The advantages of this approach were discused in Section I.</Paragraph> <Paragraph position="4"> The system pays particular attention to verb arguments, which are generally marked by prepositions This &quot;case&quot; type analysis gives pretty good direct clues to the function of items within the meaning representation. Sentencu structure is relatively regular, though extraposed phrases and a few types of clauses must be dealt with. Fortunately, the results, in terms of function of chemicals and reaction conditions, are the same whether the verb form is in an embedded clause or the ~ain verb of the sentence. Zn other words, we do not have to deal with the nuances implied by higher predicates, or with implicative verbs, presuppositions, and the llke.</Paragraph> </Section> <Section position="6" start_page="109" end_page="109" type="metho"> <SectionTitle> 4. The Semantic Sta~e </SectionTitle> <Paragraph position="0"> The semantic mapping could be directly to the components of the reaction information form, and that is the approach that was implemented in the first programs. This gave reasonable results in some test cases, but appeared co be less extenslble to other models of reaction description than was desirable. A SNOBOL4 version maps the syntax to a predicate-arg,.ment formalism, with a case frame for each verb designating the posslbte ~rguments for each predicate.</Paragraph> </Section> <Section position="7" start_page="109" end_page="109" type="metho"> <SectionTitle> 5, The Extraction Sta~e </SectionTitle> <Paragraph position="0"> The meaning representation gives a pretty clear indication of the function of items within the RIF in the simple model. Since we wadted to experiment wlth generality in this system, we wished to separate general knowledge from linguistic knowledge, and for that reason, the actual extraction of items is done using the frame technique (Minaky, 1975; Charniak, 1975).</Paragraph> <Paragraph position="1"> In the literature, frames and similar devices vary both in their format and in their function. Tn some cases, the information that they encode is still linguistic, at least in part.</Paragraph> <Paragraph position="2"> We are using them in the &quot;nonllngulstlc&quot; sense, as discussed in Section I. ~n our system, frames encode the expectations that a trained reader would brin E to the task of extracting information from synthetic descriptions, involving the usual structure of these descriptions.</Paragraph> <Paragraph position="3"> A frame is being developed initially for the simple model. This frame looks for the synthesis section, dlsc~ rd ing work-up and characr ,~ '.? ': j~l,~,\[ ~. -} ~;',~ .j:jv~.-,.I~j. ~.L r.hen focuses on the synthesis, whe -~' subframes correspond to the particular entrle~ :~eeded in the RIF.</Paragraph> <Paragraph position="4"> As one example, the &quot;time&quot; frame expects to find a series of re~=tlon step times in the description. These are already labelled &quot;time&quot;, and the frame will know that it has to total them.</Paragraph> <Paragraph position="5"> making approximations of such time expressions as &quot;overnight&quot; and indicating that the total tS then approximate. Another example is the &quot;temperature&quot; frame, which expects a series of temperatures, and must calculate the minimum and maximum, in order to specify a range. Again, a certain amount of specialized knowledge, such as the temperature indicated by an ice water bat~, is necessary.</Paragraph> <Paragraph position="6"> C. Evaluation of the S~s~em As of the date of this paper, we have only experimented with the version of the system that maps directly from the syntax into componu.: ~ the reaction coding form. As noted above, this version does not have the generality that we desire, but gives a pretty good indication of the capabilities of the system, as now Implemented.</Paragraph> <Paragraph position="7"> Am a test of the system, we ran it on fifty synthetic paragraphs from the experimental sections of the Journal of 0rsanic Ch,~stry, and thirty-six were processed satisfactorily. Four had clear, detectable problems, so the robustness was 92%, but the accuracy was only 78%, since ten of the paragraphs did not follow the simple model, and were nevertheless processed. Since these were full of errors, we did not try to compute a figure for average error rate.</Paragraph> <Paragraph position="8"> Although the objective of building this experimental system was only to deal with the simple model, the exercise has made clear to us the importance of the safety factor in making a system such as thls useful. We intend to continua work with the present system only for a few weekS, meanwhile considering the problems and promises of extending it.</Paragraph> <Paragraph position="9"> fall within chi.~ paradigm include one constructed by the Operating Systems Division of Logicon (Silva, Montgomery and Dwiggins, 1979), which aims tO &quot;atodel the cognitive activities of the htanan analyst as he reads and understands message text, distilling its contents Into information items of internst to him, and building a conceptual model of the lnformgtion conveyed by the meBsase,&quot; In the area of missile and satellite reports and aircraft activittu. Another project, at Rutgers University, Involva the analysis of case descriptions concerning glaucoma patients (Ci esi elski, 1979), and the most extensive SIE project, also in the medical area, is that of the group headed by NaomPS Sager (1981) at New York University, and described in her book.</Paragraph> <Paragraph position="10"> IV. RELATION TO SOME OTHER SIE SYSTEMS The problem chat we have had concerning the safety factor is one chat is likely to be found in any $IE system, but iC/ is soluble we feel. Even though we have not completed work on this experimental system as of the time of writing this paper (we have found more syntactic and semantic procedures ro be implemented), we already have ideas as C/o how to build in a better safety factor. Generally, these can be characterized as using some of the information chat can be gleaned by a comblnat ion of llnguls tic and chemical knowledge which we had ignored as redundant.</Paragraph> <Paragraph position="11"> While It is redundant in &quot;successful&quot; cases, it produces conflicts tn other cases, indicating that something is wrong, and that the document should be processed b? hand.</Paragraph> <Paragraph position="12"> If the safety Eactor can be improved, SIE systems offer a promising area of application of computational PSingulst tcs rechnl ques. Clear\[?, nothing less than computational linguistics techniques show any hope of providing a reasonable safety factor -or ever adequare robustness and accuracy, The promise of the SIE area has been recognized by other researchers. Systems that</Paragraph> </Section> class="xml-element"></Paper>