File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/69/c69-0301_metho.xml
Size: 52,921 bytes
Last Modified: 2025-10-06 14:11:03
<?xml version="1.0" standalone="yes"?> <Paper uid="C69-0301"> <Title>N EXUS A LINGUISTIC TECHNIQUE FOR P RECOORDINATION</Title> <Section position="3" start_page="0" end_page="13" type="metho"> <SectionTitle> SECTION 2 </SectionTitle> <Paragraph position="0"> Indexing for Information Retrieval An individual is faced with the prospect of maintaining a growing collection of documentation. The documents in this collection contain information that will answer frequently asked questions. When the collection consists of a few documents, this individual can read them all and be prepared to answer these questions. But, as the amount of documeqts increases, he will be forced to find some method of recording clues to the information found in each document. These clues will have to be stored separately from the documents, on a list or perhaps on file cards, so that the maintainer of the documents can scan them easily. When he is asked a question, instead of trying to remember which document or documents have the answer, he goes to his list of clues, and then selects the documents from the collection. The number assigned to each group of clues is the same as the number on the document. null Let us assume that most or even all of the questions asked of this individual are predictable. He is then in the fortunate position of being able to look for specific answers to specific questions as he records the clues from each incoming document. He can then arrange the list of clues in whatever order is most convenient for him. He can arrange the clues by frequency of questions asked, he can classify the clues by hierarchical relationship, by chronology or by any other convenient method that might best or most quickly answer these stock questions.</Paragraph> <Paragraph position="1"> In some very fortunate cases, a collection of documentation consists of documents that have been specifically designed to answer questions. Each document is constructed with a consistent number of information or data blocks and the contents of these blocks vary to a predictable degree. The recording of information clues (we may as well now refer to this function as indexing) then becomes a simple task.</Paragraph> <Paragraph position="2"> Collections of technical papers, the most common type of information collections, do not lend themselves to similar handling. One can predict only to a very small degree, what questions will be asked of such a collection. Therefore, the indexer must select clues from each document based on his speculation of what questions will be asked in the future. It would seem that wearenow getting a vague picture of what an indexer looks like. He is able to pick up any highly technical paper, most of which are at the forefront of their disciplines (otherwise why should they be published?), to understand the content of this document so expertly that he can predict the questions that will be asked and then answered by this document, and then to record the clues to its contents in such a manner that they will lead a searcher directly to this segment of recorded knowledge at some unknown future date. This astute person must certainly possess knowlege equivalent to advanced degree level in numerous scientific disciplines, he must have working knowledge of many of the world's languages, surely he must possess an advanced degree in Library Science {more popular - Information Science), and the knowledge of practical economics to such an extent that he can subsist comfortably on six to seven thousand a year (the going rate for indexers). Armed with such a formidable background this individual would render better service, at least to himself, by doing the research and writing the paper himself.</Paragraph> <Paragraph position="3"> Obviously, the indexing function must be performed by someone less qualified than the individual described above.</Paragraph> <Paragraph position="4"> In a normal library atmosphere, the area usually given responsibility for the important endeavor of maintaining documentation collections, there is a traditional way to process such material. Indexing is performed using such aids as subject-heading lists or thesauri. The documentalist/librarian use of the term, thesaurus, refers to a dictionary-order list of approved indexing terms, similar to a subject-heading list.</Paragraph> <Paragraph position="5"> The indexer, in the above-mentioned environment, scans a document, tries to figure it out the best he can, and then selects terms from these approved lists that he thinks best describe the document. Sometimes this works, sometimes not. After all, the indexer cannot be expected to be expert in all technical fields. Anyway, the resulting terms that are the clues to the document's content are generalizations of this content. It goes without saying, ff a researcher is writing about a new usage of holography in pathological x-ray applications, this document surely has something to do with photographic techniques in medicine. If holography is not an approved term, it will eventually be added to the list when approved. In the meantime, it cannot be used, of course. But the term, x-ray, has been around long enough to be acceptable, and the searcher can hunt around at a higher (more general) level until he locates the document.</Paragraph> <Paragraph position="6"> The point is, such approved term lists are designed to aid the partially knowledgeable library user (or library worker) who does not know the technical vocabularies of special disciplines well enough to use them intelligently. The use of generalized terms stems also from the attempt, on the part of librarians, to store their reading materials in related clumps within a library. This is understandable in a public library or even in a book collection of a technical library. A user wants a book on computer programming, so he goes to the section of books that contains programming books. However, if he wants to know the latest published research on a particular programming technique he will find it in document or journal article form. He will know, in his own terminology, what he wants at a considerably more specific level than &quot;computer programming,&quot; or say, than the approved Library of Congress subject heading, &quot;Electronic digital computers - Programming. &quot; Why not use the terms the researcher uses? Well, they are not controlled, you might say. A term might be in vogue today that is turned into something else tomorrow. You will clutter up your list of document clues (index) with variations of the same term. You may find some words that mean the same thing. The truth of the matter is, that the actual synonym is not as common as you might think. Slight variations of meaning exist in many words that seem to be synonymous with others. These slight variations may turn out to be highly significant in many contexts. If the words of actual technical jargon are used, some later editing may be in order, it is true, due to the high volatility of language in fast-moving technology, but the documents will be accessible to people who know this language, without translation for the benefit of the middle-man.</Paragraph> <Paragraph position="7"> The knowledgeable searcher knows this language. He uses it every day, and keeps up with its variations. The collection of documents is for his benefit, not for the convenience of the library worker.</Paragraph> <Paragraph position="8"> What impact does all this have on documentation ? And specifically on indexing ? Let us assume, for just a minute, that we do not have a crew of superintelligent people for indexers. Instead, we have a few competent clerical workers well-enough educated to spell words properly. They can't do foreign languages, so let us, necessarily, eliminate those documents for the present.</Paragraph> <Paragraph position="9"> But they can read titles; they know an author from a date; they can identify an abstract. Within the latter, they are able to tell what words are being used to describe some esoteric subject although they are unable to define the meanings of those words.</Paragraph> <Paragraph position="10"> If these people know enough to get them this far, they can eliminate the function words (is, an, the, but, etc. ) from the content words (holography, pathology, thorax, etc. ), copy down these latter content words and, in effect, perform indexing. This is indexing at the specific, not the general level.</Paragraph> <Paragraph position="11"> To generalize these terms one would have to know that holography is related to photography, pathology is related to medicine, thorax is related to anatomy, and so on. We don't expect that much sophistication from our clerical workers. We really can't afford to pay for that much knowledge. Actually, we don't want them to know that much. It could bias their indexing.</Paragraph> <Paragraph position="12"> This is exactly the way the KARDIAK 1 automated bibliography on artificial heart research was produced. Now that it has been released (almost three years ago) and has received some acclaim throughout the world of medical research (e. g., Harvard Medical School, National Library of Medicine, National Institutes of Health, etc. ), we in Technical Data Systems (better known as IS&R), the compilers of this useful work, are still unable to use it! Why? Because we are not, nor should we be, conversant with the terminology used to index it. We just don't know that much about the technical specialty of cardiac medicine. When we are asked to demonstrate how KARDIAK works, we must use a standard search of two terms; &quot;t~stein&quot; and &quot;Anomaly.&quot; During the production of this, Dr. Shafer, Artificial Heart Study Program leader, introduced us to Dr. Grey, an eminent cardiac specialist from India. At this point, KARDIAK was 50% compiled. We had about a thousand entries and had produced an interim version. Dr. Grey was asked by Dr. Sharer to pose a question to this half-KARDIAK. He thought for a moment and then asked us if we had anything in the bibliography on &quot;Ebstein's anomaly.&quot; For all we knew of this phenomenon, it could as well have been &quot;EinsteinVs anachrony. &quot; KARDIAK was queried with these terms, however, and produced a sufficient quantity of answers, to our relief and to the pleasant surprise of Dr. Grey. (He later asked for copies.) Anyway, we now use this same query as a test query of the system, because we don't have the sophistication to ask anything else.</Paragraph> <Paragraph position="13"> We should add, however, for justice's sake, that if the KARDIAK were on &quot;Information Science&quot; instead of &quot;Cardiac Medicine&quot;, the situation would surely be reversed.</Paragraph> <Paragraph position="14"> The thesis, so far, has hopefully convinced the reader that it is possible to index highly technical collections cheaply and accurately without superintelligent, universal men wielding the indexer's pencil. But we are still faced with the problem of some cross, discipline communication. We cannot query a collection on &quot;Cardiac Medicine&quot;, and they cannot query a collection on &quot;Information Science. &quot; Now then, how do we go about communicating to one another through the medium of a general-information collection? That is, how do we do this without getting too general and paying the price for this generality? KARDIAK, once again, has giver us a clue to how this may be done.</Paragraph> <Paragraph position="15"> As we were feeding KARDIAK the terms selected by our clerk/indexer, some of these terms kept recurring; recurring with such frequency that our computer program could not hold them all in storage. That is, there was not enough room set aside to hold all the document numbers with which these terms were associated. The number of these terms was small, only seven in all, but the number of documents that used these seven terms was extensive. Because of the physical impossiblity of storing all these document numbers, these terms were rejected for storage. Oddly enough, perhaps serendipitously enough, if you will, these were the terms that generally described the collection: We have here, then, the general terms to describe the KARDIAK collection, and we have them delivered automatically. If we were to decide that we must have subject headings to communicate in a general fashion to other less knowledgeable searchers, in this case to ourselves, these are doubtless the best candidates. Just for practice, let'~ make subject headings out of this list: Extracorporeal Blood Circulation.</Paragraph> <Paragraph position="16"> We don't need an approved list of terms. We couldn't have found one, nor known how to use one, if we had had one. It has been said, &quot;Let the documents themselves generate their own terms. ,,2 One step further, let the terms rejected because of over-frequency be combined as subjectheadings. These combinations can then be used as general descriptors for the particular collection.</Paragraph> <Paragraph position="17"> Tile KARDIAK is a closed collection. That is, it was produced for a specific purpose, it served its purpose, and it is now a static piece of documentation history. Of course, it can always be picked up at a later date and be added to; but we don't foresee this happening at the present time. This is all leading up to the fact that there is any amount of manipulation one can perform on a static collection that cannot be done on a growing one. When a collection is constantly being added to, one must figure out away to maintain control of it as it develops. If the collection is specialized enough, the term rejection factor, mentioned above, will still appear. But, as the collection grows, we certainly must increase our storage capacity of the ratio of document numbers to terms. This ratio probably remains the same, but we cantt say so for sure unless we do some research on it. This is an area for further work with which we are not principally concerned in this report.</Paragraph> <Paragraph position="18"> What wewould now like to suggest is an interim feature: an aid to indexing and searching that is in between a free, specific, individual key word system and a generalized, controlled subject-heading system. We have already shown an almost algorithmic way of doing indexing. The clerical worker identifies a title and an abstract, and separates content words from function words. The function words are then copied down, or in the case of KARDIAK, are keypunched directly on punched paper tape. It's easy to imagine a macMne doing essentially the same operation, and this is what we have done.</Paragraph> <Paragraph position="19"> A program was written similar to one described in previous research 3 which, using a function word deletion list, scans lines of text and records the content words that are in the original syntactical order of the text. Such a method resembles the well-known KWIC indexing system. These remaining content words can be used as index terms for searching the collection on a specific level. They can be stored on tape with each term added to previously stored usages of the term by recording the document number under that term. Or, in the case of a first-time usage of a term, a new entry on tape is made. Now, so far, this is essentially what was done by the clerical worker. But now we have avoided her occasional human errors, and since her human judgment was previously discouraged we have lost very little, and have gained a great deal in speed and accuracy.</Paragraph> <Paragraph position="20"> At this point, let's switch over to the searching function. The searcher knows the terms he is looking for, if he knows the technical specialty concerned. His query will be couched in these same terms. Therefore, he proceeds in his search of the collection by combining terms and looking for coordinating document numbers. (This follows no matter if he is doing it manually, such as with KARDIAK, or whether a computer search is made.) One element is missing, however, and that is syntax. He must presume that the hits he comes up with are of terms arranged in the same syntactical order as his search query. In other words, he is attempting to regenerate sentence order. This is successful much of the time, but then again there are times that it doesn't work.</Paragraph> <Paragraph position="21"> If we had our clerical worker again, we could show her some lines of text and ask her to combine words that bear relationship to one another. If she did a good job of making combinations, some of this missing syntax would be recovered. Let's fake a title, for example: &quot;Applications of Linguistic Experiments to the Industrial Community. &quot; Our clerk would probably make the following combinations: of the title. Now, for the clerk to do term combining correctly, she uses some simple rules. The most obvious rule is that of sequence. There are other rules used that are not so obvious, even to her, because she may not know she is using them. These rules have to do with linguistics, specifically suffixal morph(~logy. This is to say that the suffixal morphemes of the words in this title are giving her clues about the relationship of one word to another. In other words, the presence of one of a group of particles at the end of a content word in a line of text will give a clue to its relationship to the next content word. Of course, the next word in sequence must be examined for the presence of a final particle, as well. Let's take &quot;linguistic experiments&quot; as an example. The two words are in sequence in the text line, even though this is not an absolute indication that they should be combined. The suffixal morpheme of &quot;linguistic&quot; is &quot;-ic, &quot; an adjectival ending. And since there is no punctuation following &quot;-ic,&quot; this indicates the proximity of some next entity to be modified, some noun form coming up. In our example it is &quot;experiments.&quot; But, if the suffixal morpheme of &quot;experiments&quot; were &quot;-al&quot; instead of &quot;-s, &quot; and there is still no following punctuation, we would have a clue that we don't yet have a noun form to be modified. We have two adjectives stacking up, and the next following word may be the noun form we have been waiting for. However, the &quot;-s&quot; morpheme is most likely acceptable enough as a noun plural ending, and the combination &quot;linguistic experiments&quot; is a valid one.</Paragraph> <Paragraph position="22"> The application of such rules by our clerical worker is automatic because she does all these operations following the rules that are built into her knowledge of the language. She might possibly be able to explain the process but it is so ob~ous and natural to her that she might not be able to.</Paragraph> <Paragraph position="23"> To do this function by machine ia another matter. We must not only explain the process, but we must also instruct the computer precisely what to do and in what order to do it. And also, unfortunately, we must put up which occurrences of letter constructions that look like a legitimate suffixal morpheme, such as the plural &quot;-s&quot;, but are actually not; constructions which would be immediately obvious to our clerical worker.</Paragraph> <Paragraph position="24"> Succeeding sections of this report will outline the method used (NEXUS) to precoordinate terms during the automatic indexing process.</Paragraph> <Paragraph position="25"> All programming of this research task was accomplished by James C. Moore and G. E. Sullivan, of Department 591-0, in FORTRAN lI. The computer used was the CDC 160G.</Paragraph> </Section> <Section position="4" start_page="13" end_page="18" type="metho"> <SectionTitle> SECTION 3 NEXUS I </SectionTitle> <Paragraph position="0"> The inspiration for NEXUS came from a particular collection compiled by IS&R on legal literature.</Paragraph> <Paragraph position="1"> The indexing was done by an individual highly trained in law but who had never done any previous indexing. His indexing consistency, to begin with, was slightly erratic in that he occasionally repeated terms in bound form that he had already noted down in free form. However, as he progressed through the collection of 1742 documents his indexing became more stabilized.</Paragraph> <Paragraph position="2"> Each document was given an accession number. The index terms, usually six or seven of them, were listed under the number. The indexer wanted retrieval by date at some future time, so he used the year the document was published as an index term in every case.</Paragraph> <Paragraph position="3"> The output of this project was a KARDIAK-type (or &quot;busted. book&quot;, as it is known in IS&R) manual index, which was produced by computer. The terms were sorted alphabetically and the document numbers of the documents indexed by the term listed beneath each term in ascending order.</Paragraph> <Paragraph position="4"> Precoordination of these terms would have aided the searcher, in the way pre~ously indicated, as a time-saver and a syntax safeguard. This would have prevented the searcher from erroneously hooking together terms that actually were not related.</Paragraph> <Paragraph position="5"> To begin with, the unsorted sets of index terms were used as input to NEXUS. NEXUS was first put together in a very rudimentary form. The dates were isolated and the criteria for precoordination were based on (1) sequence, (2) &quot;-ed&quot; suffixal morpheme in the first position, and (3) &quot;-s:' suffixal morpheme in the second position. The flow chart for NEXUS I, with the aforementioned legal collection in mind, is shown in Figure 3-1. The first step (1) is to examine the first term in the document term set under initial examination. If the first term is a date (2), we don't want to couple it with another term, so we leave it as a single term and move on to the next word (3), if there is one. The next word is examined as a first word (4), and if it is not a date, it is tested (5) for a final plural morpheme, &quot;-s&quot;. If it does end with &quot;-s&quot;, a preceding word is looked for (6). If no preceding held word exists, the term is printed as a single term (7).. If the term does not have an &quot;-s&quot; ending, it is held for pairing (8) with the next word in the set (9). If this held word is the last in the set, it is also (7) printed as a single term. But, if there is a next word (16), thenext word is examined and (11)</Paragraph> <Paragraph position="7"> tested for being a date. If it is a date, it is printed (12) as a single term.</Paragraph> <Paragraph position="8"> If not, it receives a test for &quot;-ed&quot; (13) as the final morpheme. This morpheme can only be allowed with the first word of a pair (unless, of course, it is the last term in the set; in which case it is printed alone), ff &quot;-ed&quot; is present, the held first word (14) is printed by itself, and the &quot;-ed&quot; word is held for first-position pairing. If &quot;-ed&quot; is not present, the held word is printed with this word (15) as a coupled pair.</Paragraph> <Paragraph position="9"> Let's go back to (5) where a word is tested for the presence of an &quot;-s&quot; final morpheme. The word does end with &quot;-s&quot;, so we check for a preceding word (6). In this case we will get &quot;yes&quot; for an answer, and the next test is (16), &quot;Does the preceding word end with &quot;-s&quot;? If '~no&quot; to this test (17), the word with &quot;-s&quot; ending is printed in the second position of a pair, with the preceding word in first position. If the answer to (16) is &quot;yes&quot;, the function (18) is activated, which checks the word preceding and (19) checks that word for a suffixal &quot;-s&quot;. The program loops between (18) and (19) until a non-&quot;-s&quot; suffixal morpheme word is found. It then (20) prints the latter word in first position, followed by all &quot;-s &quot;-ending words. This portion of the NEXUS I program can produce precoordinations of more than two words. The remainder of the tests and functions on this flow chart are probably selfexplanatory. If the program runs through the set of terms for one document, indicated by a &quot;no&quot; at test (3), the next test (21) asks &quot;Is there a next record?&quot;. If &quot;yes&quot;, the next set of terms for a document is brought up by function (22) and the processing continues. If all document term sets have been processed.</Paragraph> <Paragraph position="10"> the answer to (21) is &quot;no&quot;, and the program terminates.</Paragraph> <Paragraph position="11"> The NEXUS I program processed all 1742 document sets contained in the legal information system. The results of this processing produced 4078 combinations. 3527 of these were good precoordinations. 154 times terms with &quot;-s&quot; suffixal morphemes were isolated and thereby avoided ambiguous combinations. 397 precoordinations were unsuccessful. The latter quantity, however, was the source of further rules that will be applied to future versions of NEXUS. We knew that the development of this program would have to involve expansion of the rules step by step. So some of the bad coordinations showed us where more rules could have been applied to avoid them. Of course, some of these anomalies were unavoidable. They were merely caused by characteristics of the language with which we have to live if we are going to continue to speak English.</Paragraph> <Paragraph position="12"> Because of our &quot;-s&quot; rule in second position only, the program isolated &quot;Jurimetrics&quot; instead of making the obvious (to a human) coordination, &quot;Jurimetrics Committee.&quot; The rule must be valid for only one position, and the second position is the most common one. Continuing the sequence, &quot;Committee&quot; was precoordinated with &quot;Scientific&quot; because of the sequence rule. This is also an,obvious error to a human, because of the suffixal morpheme &quot;-ic&quot;, which is part of &quot;Scientific. &quot; In analyzing the production, so far, &quot;-ie&quot; seems like a good candidate for a first-postion suffixal morpheme; so, it became one in the next version of the program. The next combination, &quot;Scientific Investigation,&quot; turned out successfully because of sequence, but &quot;Investigation Legal&quot; went bad; once again because of a suffixal morpheme cue that wasn't included in the program.</Paragraph> <Paragraph position="13"> This morpheme was the &quot;-al&quot; on the term ,&quot;Legal&quot; which was later included as a first-position rule. Finally, &quot;Legal Problems&quot; was produced, meeting the requirements of both sequence and &quot;-s&quot; rules.</Paragraph> <Paragraph position="14"> Please bear in mind that the rules incorporated in this program can never attain 100% effectivity. Natural language won't allow it. Still, NEXUS I delivered 90% correct precoordinations, which is encouraging as the first try of an experimental program.</Paragraph> </Section> <Section position="5" start_page="18" end_page="20" type="metho"> <SectionTitle> SECTION 4 NEXUS II </SectionTitle> <Paragraph position="0"> Based on the success (and the failures) of NEXUS I, an expanded version of the program was written. NEXUS II was made more effective by adding rules principally affecting first-position qualification, and one rule affecting both first and second position.</Paragraph> <Paragraph position="1"> The new first-position rules included the suffixal morphemes &quot;-al&quot;, &quot;-ern&quot;, &quot;-ese&quot;, &quot;-ic&quot;, &quot;-ive&quot;, &quot;-ly&quot; and &quot;-ous&quot;. The remaining rule was one that prevented two words with &quot;-ing&quot; endings from being paired together.</Paragraph> <Paragraph position="2"> As you may have noticed, the first-position rule, &quot;-ous&quot; conflicts with the second-position rule, &quot;-s&quot;. The latter rule looks for a final &quot;-s&quot; only and when it finds one, qualifies the term for second position. Because of this, the &quot;-s&quot; test must also include a test for preceding &quot;o&quot; and &quot;u&quot;. When these are present, we have a first-position rule in effect; when absent, a second-position rule.</Paragraph> <Paragraph position="3"> One of the NEXUS I rules was eliminated. The rule for stacking &quot;-s&quot; words and attaching the first non-&quot;-s&quot; as a first-position word. This rule did not produce anything of value, and could possibly have contributed to ambiguity.</Paragraph> <Paragraph position="4"> However, a turnabout version of this rule was adopted. This rule, if it locates a sequence of first-position suffixal morphemes, will stack them up until it finds a second-position word. It then prints them all in combination. In this way, we have a method for creating strings of terms in precoordination consisting of more than two words. &quot;Three-dimensional Holographic Techniques,&quot; is an example of a production of this kind.</Paragraph> <Paragraph position="5"> NEXUS I contained an overlapping feature which we haven't mentioned, but which may have been obvious when we went through the &quot;Jurimetrics, Committee, Scientific, and so on&quot; example. The purpose of overlapping was to left-justify each term whether combined or left alone, so that it could be stored alphabetically in an IS&R system. In this way, no term is hidden from the search by reason of being forever concealed in second position in storage.</Paragraph> <Paragraph position="6"> We did install a jump switch in NEXUS II, so that we can eliminate overlapping, if desired. If mere subject-headings are required, overlapping is of no value; but if the option for a free-term search is needed, the overlapping feature allows storage and search exposure of each individual term.</Paragraph> <Paragraph position="7"> Two very different corpora were run against NEXUS II. The first was a collection of computer program descriptions which was assembled for the Scientific Master Programming System (published as &quot;Information Storage and Retrieval Computer Program Index, GDC-DBA68-003). The second was a series of documents from the NASA Tape System collection.</Paragraph> <Paragraph position="8"> The program descriptions consisted of abstracts of what each particular program was intended to do, and how it operated. Each description also had a short name, a title, the computer language used, the name of the responsible programmer and the responsible engineer, and a set of terms used to index the description.</Paragraph> <Paragraph position="9"> The abstract portion of each description was used to supply NEXUS II with material to work with. The abstracts were first processed through an auto-indexer to produce lists of terms. These lists were next presented to NEXUS II and then printed out for analysis after the term-binding operations were performed. NEXUS lI was run two ways; with and without the overlapping feature.</Paragraph> <Paragraph position="10"> The program worked well with this material, with one exception. The suffixal morpheme carried by the third person singular, present tense verb, &quot;-s&quot;, has the same physical appearance as the plural morpheme, &quot;-s&quot;. Since the computer can't tell the difference, there occurred some bound terms that were somewhat loss than rife with meaning; for example, &quot;Program Calculates&quot;, &quot;Computes&quot;, &quot;Program Generates&quot;, &quot;Program Uses&quot;. Although these odd combinations could be avoided by employing a different writing style when producing the abstracts, we are not concerned with preconditioning a corpus, rather with handling it in whatever form we happen to find it. The above combinations can certainly be tolerated, however, since they have no effect on the other precoordinations. Still, their value in a future search may be predicted as slight. A NEXUS II processed record of the computer program descriptions shows:</Paragraph> </Section> <Section position="6" start_page="20" end_page="21" type="metho"> <SectionTitle> 9916 Geometry </SectionTitle> <Paragraph position="0"> The second corpus processed through NEXUS II consisted of titles of documents from the NASA Tape System. These titles were first auto-indexed in the same way as the abstracts of the computer program descriptions. The lists of terms derived in this way were then given the NEXUf~ II treatment.</Paragraph> </Section> <Section position="7" start_page="21" end_page="28" type="metho"> <SectionTitle> SECTION 5 </SectionTitle> <Paragraph position="0"> As an exercise in demonstrating the difficulties encountered in handling natural language for computerized information retrieval, the NEXUS experiments have been very successful.</Paragraph> <Paragraph position="1"> The intent has been to expand upon more or less standard automatic indexing techniques by reestablishing a connection between terms that, when combined, aid the searcher in retrieving a document reference from storage. We have named this process precoordination because of its relationship to coordinate index systems. In a coordinate index the searcher combines terms, looking for a common accession number, thereby indicating their occurrence together in a document description. NEXUS has an application in precoordinating these terms, when applicable, to save time for the searcher and to ensure a correct coordination and to prevent coordinating terms that give a misleading implication. Precoordinated terms are then, in effect, equivalent to subject headings insofar as they partially express a concept in one or more words in a syntatic construction.</Paragraph> <Paragraph position="2"> The comparison of NEXUS, and its several linguistically-based rules, with SEQS, and its single rule for sequential linking, has shown that NEXUS is the more efficient of the two approaches. Neither, of course, can compare with human decision power, which has the ability to employ knowledge, past experience, and heuristics. Since we are trying to approach a human intellectual activity using a machine, however, the work of a human will probably always make our results look inferior. We arc limited to looking at words primarily as physical entities and then relating these physical features to semantic relationships. There is only so much to work with in English, and that much is not 100% reliable, as we have seen.</Paragraph> <Paragraph position="3"> We have attempted to use a simple algorithm, and to add to it, or subtract from it, through trial and error. No doubt these rules can be expanded more than they have been, so the program is open to further additions at any time.</Paragraph> <Paragraph position="4"> The NEXUS II flow chart, Figure 5-1, with a narrative explanation, follows.</Paragraph> <Paragraph position="5"> The first step at (1) is to read a record, a document term set. Step (2) examines the first term in the set and if there is one, moves through the date test (3), which is a holdover from the legal data collection. Next, the program makes the first suffixal morpheme test (4). If tlie examined word does not end in ~-s TT, it is held for pairing (5) and the T~-ed~t counter is set to zero. This counter is used for all first-position suffixal morpheme words, not just for those that end in &quot;-ed'. The counter is used to keep track of the amount of first-position words that accumulate before a second-position word appears, so that they can all be printed out in a string; e.g., &quot;BINARY DIGITAL CALCULATING MACHINE&quot;.</Paragraph> <Paragraph position="6"> The program then moves to (6) where a next word is looked for. If &quot;no&quot;, the word held at (5) is printed as a single term (7) and a return to (2) is made, in turn going to (1) and the next record is begun. If (6) is &quot;yes&quot;, the NEXUS I date check is made (8) which results in &quot;yes&quot; back through (7) and then (2) again, or &quot;no&quot;, which is governed by Sense Switch 2 (9). Sense Switch 2 can be set to pass an examined work through the tests for &quot;-ing&quot; in first position (10) and in second position (11) in order to prevent coupling of words bearing these suffixes. These tests currently have no value because &quot;-ing&quot; has been established as a fairly reliable first-position suffixal morpheme and therefore must be allowed to stack up with words bearing &quot;-tug&quot; or any of the other * (* refers to NOTE - center of page, Figure 5-1) words. The test has been left in in case it ever appears to be of any future use.</Paragraph> <Paragraph position="7"> Assuming Sense Switch 2 to be in an &quot;on&quot; position, a &quot;no&quot; answer to (8) proceeds directly to (12) where the held first word receives the first-position test for &quot;-ed&quot;. If &quot;yes&quot;, the &quot;-ed&quot; counter is incremented and the second word is passed through an &quot;-ed&quot; test (14). A &quot;no&quot; at (12) passes the program directly to (14). If (14} is &quot;no&quot;, the second word is tested for presence of any of the other suffixal morphemes qualifying a word for first position (noted as *) (15). If (14) is &quot;no&quot;, the second word is tested for presence of any of the other suffixal morphemes qualifying a word for first position (noted as *) (15). If (14) is &quot;yes&quot;, the first word is tested for an * ending (16). A &quot;no&quot; at (15} moves the program to (17) where the first and second words are printed, the counter is set to zero and a flag, 2, (for later identification as a coupled pair) is placed at the end of the first and second words. This flag is externally suppressed.</Paragraph> <Paragraph position="8"> Passing through an indexer (pointing to the last word of a combination} and moving further to {18}, there is a Sense Switch 1, that controls overlapping. This is the feature that assures all terms a left-justified accessibility, by printing terms indi~cidually as well as in combinations. With the sense switch off, the program moves to (7) and the last word in the combination is printed alone. With the sense switch on, the program returns to (2) and continues through the record.</Paragraph> <Paragraph position="9"> Backing up now, to (15}. If a &quot;yes&quot; answer is made at (15), the first word is tested for * ending at (16}. If &quot;no&quot; at (16), the first word receives an &quot;-ed&quot; test (19} and upon receiving another &quot;no&quot; at (19) the first word is printed alone at (7). If &quot;yes&quot; at either (16) or (19}, the &quot;-ed&quot; counter (20) (which also counts * words}, is incremented and a test for a next word is encountered at (21). If there is not a next word in the record under examination, each &quot;-ed&quot; {or *) word is printed individually (22) and the counter reset to zero. The program then goes back to (2). If there is a next word in the record, the date test is made (23). If &quot;yes&quot; on (23), the print instruction (22) is applied to all &quot;~ed&quot;/* words, and then back to (2). If &quot;no&quot; on (23), the next word is checked for &quot;-ed&quot;, (24) and * (25). Failing both of these tests, all &quot;-ed&quot; and * words are printed in a string (with the last member of the string a non-&quot;-ed&quot;/*) (26). If either of these tests (24}, (25) are positive, the program loops back through (2}, increments the &quot;-ed&quot; counter and cycles through (21), etc., again.</Paragraph> <Paragraph position="10"> Let's now go back to the first suffixal morpheme test, the last word &quot;-s&quot; test at (4), and assume a &quot;yes&quot; answer. We then must find out if it is a plural &quot;-s&quot;, or part of an * ending, &quot;-ous&quot; (27). If it is &quot;-ous&quot;, we then go to (5), and thence through the route just explained above. If it is not &quot;-ous&quot;, but a plural &quot;-s&quot;, we move to {28} to check for a preceding word. If there is no preceding word, the &quot;-s&quot; word is printed as a single word (29}, and back to (2). If there is a preceding word, the date check (30) goes into effect. If positive, the program moves to (29) and the &quot;-s&quot; word is printed as a single term. If &quot;no&quot; on (30) the test is made &quot;Does preceding word end with '-s' ?&quot; (31) which, when &quot;no&quot;, moves the program to (32) &quot;Is the preceding word part of a coupled pair ?&quot;. This is the reason for the flag put at the end of the 1st and 2nd words at (17}.</Paragraph> <Paragraph position="11"> If &quot;yes&quot; at (32) the program shifts to (29) where the &quot;-s&quot; word is printed as a single word. If &quot;no v' at (32), the program prints the preceding word with this &quot;-s&quot; word (33). If &quot;yes&quot; at (31), there is a test for a preceding word (34). If &quot;yes&quot; at (34), the date test (35) takes place. If &quot;no&quot; at (34), the program shifts to (29) and prints the &quot;-s&quot; word as a single word, and then goes back to (2). This also occurs when there is a &quot;yes&quot; answer at (35). If &quot;no&quot; at (35), the program goes back to (29) where the &quot;-s&quot; word is printed as a single word. This is the latest version of NEXUS II. The flow chart has superfluities that haven't been removed. Many instructions could be combined to save operations. But, the intent has been to get this program operating and reported on. The flaws that are obvious are the combining of various rules that apply to &quot;-ed&quot; endings as well as * endings. These rules are to be treated the same. No doubt, other things could be combined to make a more efficient program.</Paragraph> <Paragraph position="12"> A few suggestions for applying this method should be made. The previous method for auto-iudexed terms has been to use them in a &quot;busted-book&quot; or computer-generatod coordinate index. The NEXUS-generated subject headings are definitely not Suitable for this type of output. The best type of output format would be something approaching v~hat was done for the Aeromedical Evacuation Study Bibliography. 4 That was a subject-heading listing, followed by a full bibliographical entry: author, title, date, series number, and corporate author. Sample entries are shown below:</Paragraph> </Section> <Section position="8" start_page="28" end_page="30" type="metho"> <SectionTitle> ALTERATIONS MEASUREMENTS DURING AEROHEDICAL EVACUATION </SectionTitle> <Paragraph position="0"> This bibliography was machine processed after all input was subjected to human analysis.</Paragraph> <Paragraph position="1"> The final bibliography consisted of four sections: by subject, by author, by title, and by source. The latter section was an alphabetical sort of the journals, books, papers, and manuals from which the material was taken. A modification of this form of output has been suggested by Mr. James Moore of 591-0, who was responsible for NEXUS I and II programming. His suggestion involves a sort and printout of each auto-indexed term and beneath each such term is printed the N F2KUS preeoordinated set of which the term is a member. Beneath this term set would be the bibliographic entry, in entirety. Where terms are not members of a precoordinated set, they are printed alone, followed by the full bibliographic entry. A hand-generated example of this format would appear like this: (in &quot;S&quot; portion of alphabet) (in &quot;D&quot; portion of alphabet)</Paragraph> </Section> <Section position="9" start_page="30" end_page="30" type="metho"> <SectionTitle> DESCRIPTION BOEING CO. COMMERCIAL SUPERSONIC TRANSPORT PROPOSAL, A-111, AIRCRAFT DESCRIPTION. D6-2400-9 THE BOEING CO.15 JAN 64 </SectionTitle> <Paragraph position="0"> Six subject entries per document reference may seem excessive at first glance, and there may be more if an abstract is also processed, but roughly this same approach was used for the Aeromedical Evacuation Bibliography and was found to be helpful. Unfortunately, the same human analysis that was employed in processing the input to that program was not completely thorough in picking up all possible subject headings for sorting. A machine-analysis system would not suffer from this fallibility.</Paragraph> </Section> <Section position="10" start_page="30" end_page="32" type="metho"> <SectionTitle> SECTION (i </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="30" end_page="32" type="sub_section"> <SectionTitle> Recommendations </SectionTitle> <Paragraph position="0"> Linguistics is becoming more and more recognized as a basic research area in information retrieval. The problem of document analysis and indexterm selection is the most fundamental activity of all in the cycle of documentto-storage-to-document user, which is what information retrieval really amounts to.</Paragraph> <Paragraph position="1"> No matter how sophisticated the storage medium might be, no matter how fast the computer can sift through a data bank searching for information, an information retrieval system is only as good as its contents.</Paragraph> <Paragraph position="2"> Linguistics, as applied to information retrieval, is concerned with improving the input function in the design of automatic indexing, abstracting and classification methods. The kind of linquistics used in these applications is limited to the written word or the analysis and manipulation of graphemes. Linquistics, in a general sense, concerns itself with speech sounds, from which a graphemic representation of a language is one step removed. If the day ever comes that a computer can more efficiently accept the spoken word than the written word linguistics, in a fuller sense, will be found applicable. There will probably be interim improvements in methods for computer .input that will predate voice input, however, Such input devices as optical scanners and page readers may make a long-awaited appearance, for practical purposes, before people can talk to a computer in any application other than an experimental one. If there is any doubt of the superiority of the spoken word over the written as an information carrier, one merely has to read a television jingle or such a phrase as, &quot;very interestting~ &quot; heard on a popular TV program, to realize that the suprasegmental phonemes of stress, pitch, juncture and even accent in the dialect sense, completely lost in the written word, are very much present and necessary in the spoken word.</Paragraph> <Paragraph position="3"> Getting back to the kind of linguistics with which we have been directly concerned, we have been devising rules fir joining together two or more words to make up a phrase. The rules are activated when one or more characters (graphemes) are found at the ends of words (suffixal morphemes} that have an effect on the word's connectability to other words in a sequence (syntax}.</Paragraph> <Paragraph position="4"> These rules work every time. There is no decision maker involved allowing a sometimes exemption to a rule. Since the rules are of a general-purpose kind, they are set up to operate on the most frequent conditions. The exceptions to these conditions that occur occasionally are merely tolerated. No attempt has been made to set up ad hoc rules to cover them. It so happens, unfortunately, that the name &quot;Information Retrieval&quot; is one of these exceptions and would not be produced as a combination by the NEXUS program.</Paragraph> <Paragraph position="5"> Although the NEXUS method is far from perfect, even in its present state it is reasonably workable as a subject-heading generator. Its consistency of operation, of course, exceeds human processing; an advantage in some respects and a disadvantage in others, as already pointed out.</Paragraph> <Paragraph position="6"> Research of this type is not intended to produce a panacea that will solve all natural-language-input problems, but is intended to shed a little more light on language manipulation by computer and perhaps take a few tentative steps towards a solution of these problems. Hopefully, this research has been successful to that extent.</Paragraph> <Paragraph position="7"> The following pages are S-C 4020 microfilm hard copies showing a comparison between NF~US processing and SEQS processing of NASA Linear Tape System documents.</Paragraph> </Section> </Section> <Section position="11" start_page="32" end_page="37" type="metho"> <SectionTitle> 5 The NASA System has been previously converted to the IS&R format for </SectionTitle> <Paragraph position="0"> more efficient information searching. The titles of a 1000-document corpus were first auto-indexed using IS&R SIMPL programming techniques, The product of the auto-indexing operation is shown in .the first column on the left of each page. It consists of a list of content words remaining after the function words were deleted from the title.</Paragraph> <Paragraph position="1"> The middle column is a list of the word combinations created by the NEXUS II program employing linguistic rules and sequence rules.</Paragraph> <Paragraph position="2"> The SEQS column lists the combinations formed by using sequence rules alone. Here eve*T two terms are connected as they occur in syntactical order.</Paragraph> </Section> <Section position="12" start_page="37" end_page="37" type="metho"> <SectionTitle> CORPUS OF&quot; NEXUS NASA TAPE SYSTEM A LINGUISTIC TECHNIQUE DOCUMENTS FOR PRECOORDINATION DEC 1968 TITLE 'MAGNETIC FIELD MEASUREMENTS IN INTERPLANETARY SPACE AUTO-INDEXED NEXUS SEaS TEEHS I.HAGNETIC I.HAGNETIC FIELD I.MAGNETIC FIELD </SectionTitle> <Paragraph position="0"/> </Section> <Section position="13" start_page="37" end_page="37" type="metho"> <SectionTitle> TITLE 'HEASUREMENTS OF MAGNETIC PRC~E~ES DURING TIm PREHEATING PHASE OF A SPINDLE CUSP EXPERIH~NT AUTO-INDEXED NEXUS SEQS TERHS 1.HEASUREMENTS /.MEASUREMENTS </SectionTitle> <Paragraph position="0"/> </Section> <Section position="14" start_page="37" end_page="37" type="metho"> <SectionTitle> 1,MEASUREHENTS MAGNETIC 2.PROISES PREHEATING 3.PHASE SPINDLE 4.CUSP EXPERIMENT TITLE 'THE EFFECTS OF THE LAUNCH VEHICLE ON ~PACECRAPT DESIGN AUTO-INDEXED NEXUS TERHS SEQS \].~FFECTS !.EFFECTS </SectionTitle> <Paragraph position="0"/> </Section> <Section position="15" start_page="37" end_page="37" type="metho"> <SectionTitle> |.PULSED ELECTRICAL POI,~R GENERATION 2.MAGNETICALLY LOADED EXPLOSIVES 1.PULSED ELECTRICAL 2.Pck/ER GENERATION S,HAGNETICALLY LOADED 4,EXPLOS|VES 3~ CORPUS OF NEXUS NASA TAPE SYSTEM A LINGUISTIC TECHNIQUE DOCUMENTS FOR PRECOORDINATION DEC 1968 TITLE 'ELECTRICAL PULSES FROH HELICAL AND COAXIAL EXPLOSIVE GENERATORS AUTO-INDEXED NEXUS SEQS </SectionTitle> <Paragraph position="0"/> </Section> <Section position="16" start_page="37" end_page="37" type="metho"> <SectionTitle> I,ELECTRICAL PULSES 2.HELICAL COAXIAL EXPLOSIVE GENERATORS I.ELECTRICAL PULSES 2.HELICAL COAXIAL 5.EXPLOSIVE GENERATORS TITLE 'PLASMA COHPRESSION BY EXPLOSIVELY PRC/OUCED MAGNETIC FIELDS AUTO-INDEXED NEXUS </SectionTitle> <Paragraph position="0"/> </Section> <Section position="17" start_page="37" end_page="40" type="metho"> <SectionTitle> goEXPLOSIVELY PROOUCED MAGNETIC FIELDS I,PLASHA COHPRESSION 2,EXPLOSIVELY PRCOUCED 3.MAGNETIC FIELDS TITLE 'EFFECTIVE FEEDING SYSTEMS FOR PULSE GENERATORS AUTO-INDEXED NEXUS </SectionTitle> <Paragraph position="0"/> </Section> class="xml-element"></Paper>