NATURAL LANGUAGE TEXT SEGMENTATION TECHNIQUES APPLIED TO THE 
AUTOMATIC COMPILATION OF PRINTED SUBJECT INDEXES AND FOR ONLINE DATABASE ACCESS 
G. Vladu=z 
Institute for Scientific Information 
3501 Market S~reet, Philadelphia, Pennsylvania 19104 USA 
ABSTRACT 
The nature of the problem and earlier approaches 
to the automatic compilation of printed subject 
indexes are reviewed and illustrated. A simple 
method is described for the de~ection of semantically 
self-contained word phrase segments in title-like 
texts. The method is based on a predetermined list 
of acceptable types of nominative syntactic patterns 
which can be recognized using a small domain-indepen- 
dent dictionary. The transformation of the de~ected 
word phrases into subject index records is described. 
The records are used for ~he compilation of Key Word 
Phrase subJec= indexes (K~PSI). The me~hod has been 
successfully tested for the fully automatic 
production of KWPSI-type indexes to titles of 
scientific publications. The usage of KWPSI-type 
display forma~s for the~enhanced online access to 
databases is also discussed. 
i. The problem o~f automatic compilation of 
subject indexes 
Printed subject indexes (SI), such as 
back-of-the-book indexes and indexes to periodicals 
and abstracts journals remain important as the most 
common tools for information retrieval. Tradition- 
ally SI are compiled from subject descriptions 
produced for this purpose by human indexers. Such 
subject descriptions are usually nominalized 
sentences in which the word order is chosen to 
emphasize as theme one of the objects participating 
in the description; the corresponding word or word 
phrase is placed at the beginning of the nominative 
construction. Furthermore, the nominalized sentence 
is rendered in a specially transformed 
('articulated') way involving the separated by commas 
display of component word phrases together with the 
dominating prepositions; e.g. the sentence 'In lemon 
juice lead (is) determined by atomic absorption 
spectrometry' becomes 'LEMON JUICE, lead 
determination in, by atomic absorption spectroscopy.' 
Such rendering enhances the speedy understanding of 
the descriptions when browsing the index. At the 
same time it creates for the subject ~escription 
a llneary ordered sequence of focuses which can be 
used for the hierarchical multilevel grouping 
of related sets of descriptions. The main focus 
(theme) serves for the grouping of descriptions under 
a corresponding subject heading, the secondary 
focuses make possible the further subdivision of such 
group by subheadings. This is illustrated on the SI 
fragment to "Chemical Abstracts" shown on figure \[. 
.~aat P 1YfO2ts tmmm ~ 
lemtno ~tds of. blSStl3a. 11264~ 
t'omGm, oL bevetq~ 0denuf~lt~n m ~Lttlon to, T X%0b 
~m~. o~ F~mealnello. ltl&%lb ~ ~,~ ~. ....... 
mi~ -t vkmm.~ 
7440* 
$mly. ?~SlSt 
~skn~ Nutt~t 
m tl)mlts ¢~. IfsO4lOe I, ei~r *emalm~ ~ulleo 
• ~ ~.~..-, ~ ~ ~. ~ o,. 
I,,mn~ 
|t& O/. ~ I rqdmtloa mu~ (~11~51 t 
umv~. VlOtlLtOm Ot r ~m~ mvl tn reiauom m, 1707~ 
edh~im ~ot. e~|y rlmms II, p tf~126s 
¢leamn| ¢omompm. (or. P ."O7&~k~ 
¢a~unp o~Ue op~&cal/or tmpmv~ abral~on 
• arid t mmis~*~. P ~i9~1~ 
ICTV J I¢"11 yClCJ~/I II~ll~¢rvl4 t~.~ln vl p~erg*l~h,.Ion e poiynwr 
hydras,it fo~. P t32291u 
cJlonlnl fl m4km~ ~ns (or. l~3s 
Ck, qnln 4 s, dns ~or, p~Jtd~4~ in. P 120913k 
tvurd. ~htaott~ lel ~llnl..l fo~. P ~2~1~ 
hvdm~hlh¢ illl. entlblOtlf telltale \[~ I0~%37 O 
hvOmpmh, m~vmem u P IO330~v 
Figure 1 
Fragment of a subject index of traditional type to 
"Chemical Abstracts," compiled from subject descrip- 
tions by human indexers. A text processing problem, 
studied in connection with the compilation of such si 
of traditional type, was the automatic transformation 
of subject descriptions for selecting their different 
possible themes and focuses (Armitage, 1967). An 
experimental procedure, not yet implemented, takes as 
input pre-edited subject descriptions (Cohen, 1976). 
Since the generauion of subjec= descrip=ions by 
human indexers is a very expensive procedure P. Luhn 
(1959) of IBM has suggested replacing subject de- 
scriptions by titles provided by the publication's 
authors. Using only a 'negative' dictionary of high 
frequency words excluded from indexing, he designed a 
procedure for the automatic compilation of listings 
where fragments of titles are displayed repeateuly 
136 
for all their lndexable words, These words are 
alphabetized and displayed on the printed page in the 
central position of a column; their contextual 
fragments are sorted according to the right-hand side 
COntextS of the index words. Such listings, called 
Key-Word-ln-Context (ENIC) indexes, have been 
produced and successfully marketed since 1960 
"quick-a~d-dirty" $I, despite their 'mechanical' 
appearance which makes them difficult enough to read 
and browse. A fragment of KNIC index to "Biological 
Abstracts," featuring titles enriched by addition~l 
key words is shown in figure 2. 
~o~c 
o~/41~ ~t awn~we ~ 
Figure 2. 
Fragment of a Key-Nord-in-Context (KNIt) type 
subject index to "~iological Abstracts," 
automatically compiled from titles of 
biological publicaclons, The blank sparse 
replace the repeated occurrences of ~he key 
word appearing above. 
Another mechanically compiled SI substitute 
still in current use is based on a similar idea and 
simply groups together all the titles concaining a 
same indexable word. Such Key-Word-out-of-Context 
(~WOC) indexes display the full texts of titles under 
a common beading. Figure 3 shows a KWOC sample 
generated from title-Like subject descriptions ac the 
Institute for Scientific Information, The appearance 
of KWOC indexes is more steep,abe but their bro~elng 
is much hindered by the lack of articulation of the 
Lengthy subject descriptions (titles). Without 
proper articulation, the recognition of the context 
immediately relevant to the index word becomes too 
slow. 
In 1966 the Institute for Scientific Information 
(ISI) introduced a different type of automatically 
compiled subject index called PERMUTERM Subject Index 
(PSI) OGarfieid, 19760, which at present is the main 
type of SI to the Science Cltation Index and other 
similar ISI publications. Two different negative 
dictionaries are used for producing ~his SI: a so 
called "full stop list" of words excluded from 
becoming headings as well as from being used as 
subordinate index entries, and a "semi- stop List" of 
~ords of little informative value, which are noC 
allowed as headings but are used as index entries 
along with words ~ound neither in the full-stop uor 
in ~he semi-stop Lists. ~n the PSI every word 
Co-occuring wi~h the heading word in some 
HYORO~E 
Figure 3. 
IMMUNOPRECI~TATION 
~ ~e '~weamoge~a.l,,e,J~l va~ 
¢ 
Fragment of a KNOC index compiled from relatively 
long sub3ect descriptions° The words priuted in 
lowerc~s letters are "stop words," no~ used as inde~ 
headings. 
• ubJect description (tlCle) becomes an entry line 
subordlu~ted to this heading, The format of ~he PSI 
is illustrated in figure 4. 
AOE~OCAmCl ADENOMAS 
• - ~,~ 
4 
Figure 4. 
Fragment of a PERMUTERM subject index to the 
Science Citation Index (Institute for Scientific 
Information), automatically compiled from ~it±es. 
The index lines are words co-occucing in ci~les wk~ 
the heading ,#ord. The arrows indicate the first 
occurrence under the given heading of a pointer to 
given article. 
137 
PSI has the unique ability to make possible the 
easy retrieval of all titles containing any given 
pair of informative words. This ability is similar 
to the ability of computerized online search systems 
to retrieve titles by any boolean combination of 
search terms. The corresponding PSI ability is 
available to PSI users who have been instructed about 
the principles used for compiling it. The naive user 
is more likely to utilize it as a browsing tool. 
When doing so, he may be inclined to perceive the 
subordinate word entries as being the immediate 
context of the headings. Used as a browsing tool, 
PSI may deliver relatively high percentage of false 
drops because of the lack of contextual information. 
Another shortcoming of the PSI is its relatively high 
cost due to its significant size which is propor- 
tional to the square of the average length of titles. 
The large number of entries subordinated to headings 
which are words of relatively high frequency makes 
the exhaustive scanning of entries under such 
headings a time consumLing procedure. 
An important advantage of all the above computer 
generated indexes over their manually compiled 
counterparts is the speed and essentially lower cost 
at which they are made available. 
All the above compilation procedures are based 
exclusively on the most trivial facts concerning the 
syntaxis and semantics of natural languages. They 
make use of the fact that texts are built of words, 
of the existence of words having purely syntactic 
functions and of the existence of lexlcal units of 
very little informative value. A common disadvantage 
versus the SI of traditional type is that the above 
procedures fall to provide articulated contexts 
which would be short enough and structurally simple 
enough to be easily 8rasped in the course of 
browsing. 
Certainly this problem can be solved by any 
systen which can perform the full syntactic analysis 
of titles or similar kinds of subject descriptions. 
From the syntactic tree of the title a brief 
articulated context can be produced for any given 
word of a title by detecting a subtree of suitable 
size which includes the given word. However, in the 
majority of cases the practical conditions of 
application of index compilation procedures are 
excluding the usage of full scale syntactic analysis, 
based on dictionaries containing the required 
morphological, syntactic and semantic information for 
all the lexical units of the processed input. For 
instance, ISI is processing annually for its mutli- 
disciplinary publications around 700,000 titles 
ranging in their subject orientation from science and 
technology to arts and humanities. The effort needed 
for the creation and maintenance of dictionaries 
covering several hundred thousands entries with a 
high ratio of appearance of new words would be 
excessive. Therefore, the automatic compilation of 
SI is practially feasible only on the basis of quite 
simplistic procedures based on "negative" dic- 
tionaries involving approximative methods of analysis 
which yield good results in ~he majority of cases, 
but are robust enough not to break down even in 
difficult cases. 
At one end of the range of problems involving 
natural language processsing are such as question 
answering which require a high degree of analytic 
sophistication and are based on a significant amount 
of domain dependent information formated in bulky 
lexicons. Such procedures appear to be applicable to 
texts dealing with rather narrow fields of knowledge 
in the same way as the high levels of iu-depth human 
expertise are usually limited to specific domains. 
On the other end of the spectrum are simple problems 
requiring much less domain dependent information and 
relatively low levels of "intelligence" (oefined as 
the ability to discuss comprehensive texts from 
gibberish); the corresponding procedures are usually 
applicable to wide categories of texts. For reasons 
explained above, we consider the problems of 
automatic compilation of subject indexes as belongin~ 
to this low end of the spectrum. 
2. The automatic compilation of Key-Word- 
Phrase Sub~ect Indexes 
In this framework we developed an automatic 
procedure for the compilaclon of a SI based on :he 
detection and usage of word phrases. The earlier 
stages of development of this Key-Word-Phrase subject 
index (KW?SI) have been reported elsewhere (Vladutz 
19795. The procedure starts by detecting certain 
types of syntactically self-contained segments of the 
input text; such segments are expected to be 
semantically self-contained in view of the assumed 
well-formedness of the input. The segment detection 
procedure is based on a relatively short list of 
acceptable syntactic patterns, formulated in terms of 
markers attributable by a simple dictionary look-up. 
The markers are essentially ~he same as used in 
(Klein 1963) in the early days of machine translation 
for automatic grammatical coding of English words. 
All the words not found in an exlusion dictionary of 
"~ 1,500 words are assigned the two markers ADJ 
and NOUN. All the acceptable syntactic patterns are 
characterized in the frameworks of a generative 
gr=--,~r constructed for title-type texts. Sucn texts 
are described as sequences of segments of acceptable 
syntactic patterns separated by arbitrary filler 
segments whose syntactic pattern is different from 
the acceptable ones. The analysis procedure leading 
to the detection of acceptable segments was 
formulated as a reversal of the generative grammar 
and is performed by a right =o left scanning. .~ew 
acceptable syntactic patterns can easily be 
incorporated into the generative grammar. It is 
envisaged to use in the future existing programs for 
automatically generating analysis programs from any 
specific variant of the grammar. 
The present list of acceptable syntactic 
patterns includes such patterns where noun 
phrases are concatenated by the preposition 'OF' anu 
the conjunctions 'AND', 'OR', 'AND/OR', as well as 
constructions of the type 'NPI, NP2, ... AND 
NPi'. Since no prepositions other than 'OF' and no 
conjunctions other than 'AND', 'OR', 'AND/OR' can 
occur in the acceptable segments the occurrences of 
other prepositions and conjunctions are used for 
initial delimitation of acceptable segments, but 
the detection procedure is not limited to such usage. 
In particular, a past participle or a group 
containing adve-bs followed by a past participle are 
excluded from the acceptable segment when preceuing 
138 
an initial delimiter. The segmentation detection is 
illustrated for three titles in figure 5. 
A) SPREADING OF VIRUS INFECTION among WILD 
BIRDS And MONKEYS during INFLUENZA 
EPIDEMIC C a u s • d by VICTORIA(3~75 
V~L~Wr OF,,,A(-~N2} virus 
B) EXERCISE I u d u c • d CHANGES in LEFT- 
VENTR/CULAR Function in Patients ~rlth 
MITRALE-VALVE PROLAPSE 
C) DIFFERENTIATION OF MLC-INDUCED KILLER And 
Suppresser TTCELS by SENSITIVITY to 
FTRILAMINE 
Figure 5. 
The detection of acceptable segments is shown for 3 
titles. The words with all lowercase letters are 
prepositions and conjunctions used as initial 
deli~Liters. The words with only initial capital 
letters are "seml-stop" words, excluded from being 
used as index headings; the underscored by dotted 
lines "seml-stops" are past participles which become 
dellminters only when followed by initial dellmi~ars. 
The resulting multl-word phrases are underscored 
~wice unlike the resultlng single word phrases which 
are underscored once. 
The first part of the system's dictionary con- 
Junctions, prepositions, articles, auxiliary verbs 
and pronouns. Th/s part is completely domain 
independent. A second part of the dictionary 
consists of nouns, adjectives, verbs, present and 
past participles, all of them of little informative 
value and, therefore, called "seml-stop" words. Such 
words will not be allowtd later to become SI 
headings. The semi-stop par~ of the dictionary is 
somewhat domain-dependent and has to be atuned for 
different broad fields of knowledge such as science 
and technology, social sciences or arts and 
humanities. 
The second logical step in the SI compilation 
involves the transformation of acceptable segments 
into index records consisting of an informative word 
(not found in the system's dictionary) displayed as 
heading llne and of an index llne providing some 
relevant context for the headlng word. Each multi- 
word segment generates as many index records as many 
informative words it contains. The ri~ht-hand side 
of the segment following the heading word is placed 
at the beginning of the index line to serve as its 
i~nediate context and is followed through a senLicolon 
by the segment's left-hand side. When both sides are 
non-empty, an articulation of the index line is so 
achieved. In the case of a single word segment an 
"expansion" procedure is performed during index 
record generation. It starts by placing at the 
beginning of the index llne a fragment of the title 
consisting of the filler portion following the 
heading word and of the next acceptable segment, if 
any; this initial portion of the index line is 
followed by a semicolon after which follows the 
preceding acceptable segment, followed finally by the 
filler portion separating in the title this preceding 
segment from the heading word. The index record 
generation is illustrated in figure 6. 
139 
final "enrichment" phase of the index 
record generatlon involves the additional 
display (in parenthesis) of the unused segments 
of the processed title. 
*SPREADING 
of VIRUS INFECTION 
*VIRUS 
INFECTION; SPREADING of * 
INFECTION 
SPREADING OF VIRUS * 
*WILD 
BIRDS and MONKEYS 
*BIRDS 
and MONKEYS; WILD * 
*MONKEYS 
WILD BIRDS and * 
*INFLUENZA 
EPIDEMIC 
*EPIDEMIC 
INFLUENZA * 
*SENSITIVITY 
to FYR/LAMINE; DIFFERENTIATION of 
MLC-INDUCED KILLER SUPRESSOR T-CELLS by * 
*PYRILAMINE 
SENSITIVITY ~o * 
Figure 6. 
The transformation of Key-Word-Phrases into 
subJec~ headlngs and subject entries is 
illmstrated for the first two seFjnents of the title 
A, Figure 6. The last two examples snow how single 
word segments (from Title C) are expanded to incluoe 
the preceding and following them segemencs. 
As a result of this stage the informacioual value of 
the finally generated index record is almost 
equivalent to the information content of ~ne initial 
full title. The entire process ultimately boils down 
to the the reshuffling of some component segments of 
the initial title. The enric~ent stage of index 
record generation is illustrated on figure 7. 
The index records are alphabetized firstly 
by heading words and secondly by index lines 
with the exclusion from alphabetiza~ion of 
prepositions and conjunctions if they occur aC the 
beginning of index lines. During the 
photocomposition different parts of the index 
line are set usin~ different fonts. If in the 
original title the initial part of the index 
llne follows the head word i~nedlately this 
part is set in bold face italics, i.e. in the 
same font as the heading. The "inverted" part 
following the semicolon is set in light face roman 
letters. Finally the enrichment part of ~he index 
line, included in patens is always displayed in 
light-face italics. As a result the 
*SPREADING 
of VIRUS INFECTION (WILD BIRDS and MONKEYS; 
INFLUENZA EPIDEMIC; VICTORIA(#)75 VAKIANT 
of A(K3N2) VIRUS) 
*VIRUS 
INFECTION; SPREADING OF * (WILD BIRDS and 
MONKEYS; INFLUENZA EPIDEMIC; VICTORIA(3)75 
VARIANT of A(H3N2) VIRUS) 
*BIRDS 
and MONKEYS; WILD * (SPREADING of VIRUS 
INFECTIONS; INFLUENZA EPIDEMIC, EPIDEMIC; 
VICTORIA(3)75 VARIANT of A(H3N2) VIRUS) 
*INFLUENZA 
EPIDEMIC (SPREADING of VIRUS INFECTIONS; 
VICTORIA(3)75 VARIANT of A(H3N2) VIRUS) 
*PYRILAMINE 
SENSITIVITY to * (DIFFERENTIATION of MLC- 
INDUCED KILLER and SUPPRESSOR T-CELS) 
Figure 7 
The enrichment of the subject entries by the display 
(in parenthesis) of the unused by them segments of 
the same title, illustrated for some of the entries 
of Figure 6. 
immediately relevan~ coutext of the head word is 
displayed in bold face in order to facilitate its 
rapid grasping when browsing. Details of the 
appearance and s~ructure of KWPSI are exemplified in 
figure 8 on a sample compiled for titles of publica- 
tions dealing with librarianship and information 
science. The general appearance of KWPSI is close 
enough to the appearance of SI of traditional type. 
For purposes of transportability the KWPSI 
system is programmed in ANSA COBOL. It includes two 
modules: the index generation module and the sorting 
and reformatting module. On an IBM 370 system index 
records are generated for titles of scholarly papers 
at a speed of ~ 70,000 titles/hour. The resulting 
total size of the index is of the same order as the 
size of KWOC indexes and compares favorably with the 
size of the PSI index. 
The analysis of ~he rates and ~aCure of failures 
of ~he segment detection algorithm shows that 
in 96% of cases the generated segments are fully 
acceptable as valuable index entries. In 2% of 
cases some important information is lost as a result 
of the elimination of prepositions, as in case of 
expressions of 'wood to wood' type. The rest of 
failures results in somehow awkward segments which 
are not completely semantically self-contained. Even 
in such cases the index entries retain some 
informative value. Around half of the failures can 
be eliminated by additions to the system's 
dictionary, especially by the inclusion of more verbs 
and past participles. Not counted aS failures are 
the 5% of cases when the leng=h of the detected 
segments is excessive; such segments can include the 
whole title. 
The extent of tuning required for =he 
application of the system in a new area of knowledge 
depends mainly upon ~he extent of figure 8. 
*INDEXING 
• 
KSNIE~CML) ...................... IC-80-2.SS7 CONV\[~NC~ md COMP&IIgRJI~ ~T1k~ ~ M.L- 
UmO~-IIOOR-CHAMIUER u~ MEI~.AL 
SUlIJ\[CT * ......................... IC-IiO-2-SQ4* 
¢I~Vl/E/~IACZ.. ANA,%VSIS Of SUIJECT • (USE 
CENTRAI.~TO $UIJECT WOEX~ ~RF¢~. 
OF. tW~.SOC~.S~E~S.4C4OEM~, CENtRe. 
CoM~rTEE,OF.T~.CPSU) ............. K:.110-2.SII3 
w ~JR~eWT 4 WAn~Wt$J' sfnva~l.. ~J~TK:UI.ATT0 
SUBJECT " ¢COMPAMSON.I ~ ~ ~O~ ~X~ 
L~/~mv STtOES) ................... IC-10- 2 .Sll t 
OAVAIIalur * .......................... IC-IIO-2.SIO 
m'DOCUM£/W~" SUIRCT " flM4LJ~I;',O~ ~' 
k, M TNEMA TICAI. MOOEL$) .............. IC-S0-2-$66 
et F,A~ r S¢~Nn~c ~'It~OtC4LS (INOEX c,4 rALOGUE) 
IC.eG2-SB2 
£Jr~q~'M~,: I~POeT o~ * (BOOK ,~O~'X~NG. tOOL 
.qE,4~NG. RJr'$u.~'O.~o~qO. 44TAC~r. text') 
~C.SO-2-SI L 
FAILMRE; STUDY O~ * t'TNE,~I~D AUrOA44 I?C 
/NOEXlNG) ......................... IC-80-2.S67 
MAW IdiAN-PA~IFI¢ * .................... LC-80.2-582 
KWOC aml PRI\[CIS * (ARTICULATED $U&IECT INOEXING 
CUR#ENT AW.¢REN~$,$ .~YIC£$. COM~A~$O~ LIBRARY 
$17JD~E$) .......................... IC.80.2-Sl ~ 
LANGUAGES PA/'7~WE~t I~DUNO,~NT • (N4ruRAL 
LANGUAG£) ........................ IC;.80-2-S60 
~$.' FOUNOATION of • (~IA.~qCA T/ON/=R~.~$~) 
IC.S0.2.~67 
LAW FNFOR~I~'NT ,~d CRR~WAL JUSlIC~ tNP~flM4TION 
(NAnONAI. CRtMINAI. JU$1"/C~ Tt'IE&4UltUS, 
DL;CIt~TORS) ...................... IC-80-2.SSO 
MACNmE.AIO\[D * (EN~UNtTI~. ~UTOI44 ;1C ~VO~XtNG ol 
3RD KIND) ......................... IC-I10-2-S67 ~. 
R~FERIENC\[ STRING " ............ IC.80-2-$8 \] 
PAleR ~ and SYSTEM e~ • (APP\[IC~T~IV ol 
COMPUTERS. INI7tOOUOIVG. ;'W~.,~IU~Jb, 8UI.L£TIN) 
IC-80-2-S61 
PI'R~. US~ of C\[NTRALI~q\[D SUBJECT * (LIgRAIW,OF- 
THE,$OCIA£.SCIENC£$~ICADEM~ CENrRAL.COMMII'7"EE- 
OF. T~,~.CPSU: Alva. }'$/$ ot ,~JgJECT tNO~X/NG 
CONVERGENCE) ..................... IC-80-2-563 
m,e PEle~OtC4L UTEIt,4rt, M~ e¢.4~'~O.4M£~C4*t 
m~GRAP~r ...................... ~C.80.2,S82 
ef I~WINITO OOCUMFNY~" C~81NATION of CLASS~F~CATIO~ 
an4 SUEJECT " ...................... ~C-80.2,S6S 
~£$$." QUM~TITATIVE 0($CRIPTION of " . . . IC.80-2,S65 
w R~$~AiI~N (APPLIC, ATIO~V of INFORfvMTION SCIENCE 
RECOGNITIONS. SPECIAL SCIENCES. #VFO~M/lrloN 
SYSTEMS. LARGE.C.4PAC/rY $TO~/IGE) .... IC.80-2.SS9 
4~d RErR~tEVAIL; iNTELLIGENT * (MAN-MACHINE 
P/IRTNER$t'fll "1) ...................... IC-80.2.S6S 
ROLE of AUTOMATI~ * (O~RArlONAL ON&INE RETRfEVAL 
S ~S ffM$) .......................... ~C-80-21S 54 
ITtlK * (M~CROCOMPUTER.G~NEIM ITD GR~IP~C 
D/SPLAYS. /liD) ..................... IC.80-2-S80 
SURVEY M AUSTRALIAN * (O~RV/IT;ON$) . , IC-80.2.584 
$~r.~M: COMPUTF.R-ASSlST\[0 DYNAMIC * (CAOtS) 
IC.60.2-S67 
sr~'M; THESAURUS ehd * (~RO~LEMS of UNIVEI~S/I"Y 
O~G/INI!'/IT~ON) ..................... IC-80.2.57S TERMINOLOGY POSTIN~ F~R~IrJ~. DO(: 
~TI~EVAL II~4~ * 
(PIIERA~C/'IV ~nd KWOC) ............... IC,80-2-S48 
THEOR~'Ti~4L FOUNOAnON~" ~%ACTICE of * 
IC-80-2.565 
THE~;AURU$-ILASEO AUTOMATtC * (~FUDV of/N~XING 
\[All. USE) .......................... eC-80.2.S67 
TNESAUIIUS-IIASED OOCUMENT " (RECOGNIrlON of 
MUL T/COM~ONENT TERMS) ............ IC.80"~.$76 
F/O~: PRIOeLEMS Of COORO(NATE * (NETWOR~ O/ 
AuTOMA rED 5rl CENT, S) ............ ~C-80.2 58A 
V~RSUS RWOC AUTOMATED * (P~R~O~MANCE 
C(:~I~ARtSOIV) .................... IC.80"2 568 
I~CAmULA~Y:FREE " ................. IC,80,2 S66 
~ST-~OOO: TI~NOS ,n * ........... IC 80-~.553 
~f JRO RINO: AUTOMATIC * (/tAUICI~'YNE./llOED tN~ AING 
ENCOUNrE~P) .................... IC-80" 7 567 
~NOUSTRIAI. 
ENrF~II~tS~S rUSE o~ P/l rENT IN~O~,~ r/ON e,.~ 
/NrERNAYI~NA~ PARENT CLA$SIFICATK:~I). . IC-80.2 572 
INFORMA 7~N rPt£~AURfJ$ (CENTRAL ,AMEIP, C~ Br~ 
DOM~N~/IN.R~PUB.~C) .............. ~C-80.2 sso 
Figure 8 
A photocomposed k'WPSI sample showing details of i~s 
structure and appearance. 
deviations from the normal structure of natural 
language texts occurring in the new file. As a 
matter of fact all kinds of scholarly titles contain 
such deviations, as for instance portions of normal 
text included in parentheses or occurrences of 
mathematical or chemical symbols. We found only one 
case when the required tuning effort was siguifican~, 
namely the case of titles from ~he domain of arts and 
humanities. ISl's "Arts and Humanities Citation 
140 
Index" includes besides ci~les of arclcles, also 
cicles of book reviews, as well as descriptions of 
musical performances and musical records, compiled 
accordin G co special rules. Many contain mulciword 
names of works of arC, cakes in quotation marks, 
which have ~o be handled as single words. A KWPSZ 
sample for ar~s and humaniCies is shown on figure 9 • 
Two more KWPSI samples are given on figure 4~ 
(science and Cect~ology ciCles) and figure 11 
(&eoscience research f=on~ names). 
DRAMA 
"~~ ..... 
~ ~:::::::::::::::;~,~------~.~, 
• " .::::::::::::::. 
~ #e,l~¢m~lw m emllum~ • ~u 
allmln~ 
DRAWING 
• . ......... :~:~ 
,,'~"~ F...':~--".. ~.~.~"~'~ ' 
~:::::::::~"~ 
~-,~__-...."~....,....~...::::~,~: 
• ,,,,~ ......................... ,m,.. . ~,~"-- ,,~'=~'P,.im ~ .~ ~'" 
~mim mmi~ ** n~Itmt~im...~u~mlc 
me~ii, ii--i~cj~,~m; mmm., ,~~ ° ~" .............. " .............. ~1~1 
mmml-. ........... m. 
............................ xp~m~ 
......................... m 
,m~mm m~i. . ................. ~ns~ s 
• .~ .., 
ii~p ~ m r~f w ~q~nl~ ~ ~tr 
a*Al.,z~r 
oea~i~n~r~ 
tnap~ .~ a~L~ ammm • aeo,~ov ~ 
~lqI~ ~ ~ ~MJ~mP g~a~ma m 
dM~ 
,,' ~IU~ ~llAm~lanl - .............. umaqlll • 
'Oallr~lol, IR. Omp" 
,.alwim.* 
~1,~ ........................ ,,..sP.,I J, 
• .. ...... ;!!!!!!!!! 
' : :::::::::?:!!!i!!!iii!! : 
Figure 
L,,i}~i samples for arts and humanities titles. 
eIORD£RUNE "CANCER 
.......... --"L~-j.-~ 
• ....... w, ............... , 
I 
m 
,%'~'~ ,,,~,,'~......~ ~, 
.'~,~ u~...~....,...,~,, ~ 
~(~amm a Og*iUL • ~Jq~,r ~wsm e 
m nadar miw~I~m~ ~mrs~4 ~, e 
~-%:~,~ --.~E~2~, 
q~4ae~ 
• w 
Figure I0 
K~PSI sample for clclea covered by the mulci- 
dlsciplinaz7 SCI (science and cechnology). 
3. Possible usage o_ff automatically detected 
vord phrases for enhanced online access co 
(Lacabeses 
The common method of online access to 
commercially available textual databases of both 
bibliographic and full cexc type is through boolean 
queries formulaCsd in tern~s of single words. 
AuCo~Icically detectable word phrases of the cype 
used in the KWP$1 sysCem could be used in ~L~ree 
different rays for improving online access. 
One extreme way would involve the creation of 
word phrases of the above cype a= the input stage for 
every informative word of the Input. In response to 
a slnsle word query a sequence of screens would be 
shown displaying the image of whaC in a printed KWPSl 
would be the KWPSI section under ~he Given word ta~eL~ 
as headin G . After browsing online some par~ o~ tni~ 
up-to-dace online 51 the user could choose to limi~ 
further browsin G by respondln~ wits an additional 
search term, mos~ likely chosen from some of t~le 
already examined index entries. As a result ~he 
system would reply by ellmlna~ing from ~ne displayed 
output the encrlas aoc concalnin 8 che given word an~ 
the user would conclnue co browse che so ~rimmed 
display. Several such i~eratlons could be performe~ 
141 
ATMOSPHERE 
c..* *z s~ ii ~*ll 
IIi- ~RIIIII e# IFI I.~i .~I$ii 
~ II~R~ll~ m ! • , ........ ii.iii i 
,iI~.~ l~li'l t lll¢14&IMlm1,1 1 ~ I l .I l i ~ 
'~,~,~'~ ".._._ ~"~"*~u~ ..' ' "m .'2 ~'~" 
ml~.lllm llll,mu¢ clmul~.im mll~j ,.,~ ¢l'l~Iz ,,mm,¢~Im ~ aclM .lWl,ml~n .,I ~ - 
ll.Olll 
|1.~ 
|H*Z~ 
llllUl" r'~C~l fPWl,lll i~ ~ FI~I~ . ll-OllW 
ATMOSPHERIC 
Koo ~r~l ........ nL.llgl 
iL.o;*| 
ammummm r ! e,*¢~ o,, a o~,,~cmo, a~'m~ ,,m 
ll,opt.i 
.,., ilu~le,l, al m ,,m~m~l ilmlmll. 
I i-o¢III 
o,1~111~ I~/l~llC 7/o~ e/#1~4 GG/NI) .......... ll.llll 
~ ~lllt~ - . ................. e 1.111@ 
~1,~1 ! ~ i&~l - ll,illl 
....................... Ill| 
- .,,.. • 
I14)I01 
/lllfll I II11~I I I# I ............. II.IIII 
,,,u*ll i, clll. ~,~ll.m~. - ..... II ~101 
Figure Ii 
KW'PSI sample for names of geoscience research 
fronts. 
until the user would be left with the display of a SI 
to relevant items of the database. This SI would be 
then printed together with the full list of relevant 
items. It is ~hought that such kind of interaction 
could be more user friendly than the currently used 
boolean mode. 
Another way of using the KWPSI technique 
in an online environment would be to use the 
KWPSI format for the output of the results of a 
retrieval performed in a traditional boolean 
way. The query word which achieved the most 
strong trimming effect would be used as 
heading. 
A third way would involve the compression of a 
KWFSI section under a given heading before it is 
displayed in response to a word. One could e.g. 
retain only such noun phrases containing the given 
word which occur at least k times in the database. 
An example of such list for the "Arts and Humanities" 
database is given in figure 12. By displaying such 
lists of words closely co- occurring with a given one 
~he system would perform thesaurus-type functions. 
The implementation of all such 
possibilities would be rather difficult for any 
existin 8 system in view of the effort required 
to reprocess past input. Instead, after the input of 
a query word the corresponding full text records 
could be called a not processed online in core for 
generating KWPSl-type index records. In this case 
all the above functions could be still performed. 
AFRICAN * HORROR(S) * 
AMERICAN * iNDUSTRY 
ARCHIVES LITERARY * 
AVANT-GARDE * MAKER(S) 
BLACK * OBSCENE * 
BRAZILIAN * POLISH * 
COMPANY PRODDCTION 
CRITICISM PROPAGANDA 
DIRECTOR TECHNIQUES 
DOCUMENTARY * THEORY 
FESTIVAL(S) TV * 
HISTORY WESTERN * 
Figure 12 
Lie= of word phrases containing the word 
'FILM(S)' and appearing at least twice in 
titles covered during a three months period by 
\[SI's ARTS and HUMANITIES CITATION INDEA. Such 
automatically compiled lists are suggested as 
search aides for the online access to databases. 
Another possibility which we are considering is 
to place the K~PSI-type processing capabilities into 
a microcomputer which is being used to mediate online 
searches in remote databases. All the text records 
containin8 a given (not too frequent) word coulu be 
initially tapped from the database into the micro- 
computer. Following that the microcomputer could 
perform all the above functions in an offline mode, 

REFERENCES 

Armltage, J.E., Lynch, M.F. "Articulation 
in the Generation of Subject Indexes by 
Computer." Journal of Chemical Documentation: 
7:170-8, 1967. 

Cohen, S.M., Dayton, D.L., Salvador, R. 
"Experimental Algorithmic Generation of 
Articulated Index Entries from Natural Language 
Phrases a= Chemical Abstracts Service." 
Journal of Chemigallnformation and Computer 
Sciences: 1976 May, 16(2): 93-99. 

Garfield, E. "The Permuterm Subject 
Index: An Autobiographic Review." Journal of 
the Amer_ica_n Society fqr Information Science: 
27(5/6): 288-291, 1976. 

Klein S., Simmons R.F. "A Computational 
Approach to grammatical coding of English 
words." Journal of ACM: 10(3):334-347, 1963. 

Luhn P. "Keyword-in-Con=ext Index for Technical 
Literature." Report RC 127. New York: IBM Corp., 
Advanced System Development Division, 1959. 

Vladutz G., Garfield E. "k-WPSI - An 
Algorithmically derived Key Word/Phrase Subject 
Index." Proceedings of the ASIS; 42nd Annual 
MeetinG, Minneapolis, ~nnesota, October 14-18, 
1979; pp. 236-245. 
