File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/83/a83-1023_metho.xml
Size: 26,197 bytes
Last Modified: 2025-10-06 14:11:28
<?xml version="1.0" standalone="yes"?> <Paper uid="A83-1023"> <Title>AFRICAN * HORROR(S) * AMERICAN * iNDUSTRY ARCHIVES LITERARY * AVANT-GARDE * MAKER(S) BLACK * OBSCENE * BRAZILIAN * POLISH * COMPANY PRODDCTION CRITICISM PROPAGANDA DIRECTOR TECHNIQUES DOCUMENTARY * THEORY FESTIVAL(S) TV * HISTORY WESTERN *</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> NATURAL LANGUAGE TEXT SEGMENTATION TECHNIQUES APPLIED TO THE AUTOMATIC COMPILATION OF PRINTED SUBJECT INDEXES AND FOR ONLINE DATABASE ACCESS G. Vladu=z </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="141" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> The nature of the problem and earlier approaches to the automatic compilation of printed subject indexes are reviewed and illustrated. A simple method is described for the de~ection of semantically self-contained word phrase segments in title-like texts. The method is based on a predetermined list of acceptable types of nominative syntactic patterns which can be recognized using a small domain-independent dictionary. The transformation of the de~ected word phrases into subject index records is described.</Paragraph> <Paragraph position="1"> The records are used for ~he compilation of Key Word Phrase subJec= indexes (K~PSI). The me~hod has been successfully tested for the fully automatic production of KWPSI-type indexes to titles of scientific publications. The usage of KWPSI-type display forma~s for the~enhanced online access to databases is also discussed.</Paragraph> <Paragraph position="2"> i. The problem o~f automatic compilation of subject indexes Printed subject indexes (SI), such as back-of-the-book indexes and indexes to periodicals and abstracts journals remain important as the most common tools for information retrieval. Traditionally SI are compiled from subject descriptions produced for this purpose by human indexers. Such subject descriptions are usually nominalized sentences in which the word order is chosen to emphasize as theme one of the objects participating in the description; the corresponding word or word phrase is placed at the beginning of the nominative construction. Furthermore, the nominalized sentence is rendered in a specially transformed ('articulated') way involving the separated by commas display of component word phrases together with the dominating prepositions; e.g. the sentence 'In lemon juice lead (is) determined by atomic absorption spectrometry' becomes 'LEMON JUICE, lead determination in, by atomic absorption spectroscopy.' Such rendering enhances the speedy understanding of the descriptions when browsing the index. At the same time it creates for the subject ~escription a llneary ordered sequence of focuses which can be used for the hierarchical multilevel grouping of related sets of descriptions. The main focus (theme) serves for the grouping of descriptions under a corresponding subject heading, the secondary focuses make possible the further subdivision of such group by subheadings. This is illustrated on the SI fragment to &quot;Chemical Abstracts&quot; shown on figure \[. .~aat P 1YfO2ts tmmm ~ lemtno ~tds of. blSStl3a. 11264~ t'omGm, oL bevetq~ 0denuf~lt~n m ~Lttlon to, T X%0b</Paragraph> <Paragraph position="4"> umv~. VlOtlLtOm Ot r ~m~ mvl tn reiauom m, 1707~ edh~im ~ot. e~|y rlmms II, p tf~126s C/leamn |C/omompm. (or. P .&quot;O7&~k~ C/a~unp o~Ue op~&cal/or tmpmv~ abral~on * arid t mmis~*~. P ~i9~1~ ICTV J IC/&quot;11 yClCJ~/I II~ll~C/rvl4 t~.~ln vl p~erg*l~h,.Ion e poiynwr hydras,it fo~. P t32291u cJlonlnl fl m4km~ ~ns (or. l~3s Ck, qnln 4 s, dns ~or, p~Jtd~4~ in. P 120913k tvurd. ~htaott~ lel ~llnl..l fo~. P ~2~1~ hvdm~hlhC/ illl. entlblOtlf telltale \[~ I0~%37 O hvOmpmh, m~vmem u P IO330~v Figure 1 Fragment of a subject index of traditional type to &quot;Chemical Abstracts,&quot; compiled from subject descriptions by human indexers. A text processing problem, studied in connection with the compilation of such si of traditional type, was the automatic transformation of subject descriptions for selecting their different possible themes and focuses (Armitage, 1967). An experimental procedure, not yet implemented, takes as input pre-edited subject descriptions (Cohen, 1976). Since the generauion of subjec= descrip=ions by human indexers is a very expensive procedure P. Luhn (1959) of IBM has suggested replacing subject descriptions by titles provided by the publication's authors. Using only a 'negative' dictionary of high frequency words excluded from indexing, he designed a procedure for the automatic compilation of listings where fragments of titles are displayed repeateuly for all their lndexable words, These words are alphabetized and displayed on the printed page in the central position of a column; their contextual fragments are sorted according to the right-hand side COntextS of the index words. Such listings, called Key-Word-ln-Context (ENIC) indexes, have been produced and successfully marketed since 1960 &quot;quick-a~d-dirty&quot; $I, despite their 'mechanical' appearance which makes them difficult enough to read and browse. A fragment of KNIC index to &quot;Biological Abstracts,&quot; featuring titles enriched by addition~l key words is shown in figure 2.</Paragraph> <Paragraph position="6"> automatically compiled from titles of biological publicaclons, The blank sparse replace the repeated occurrences of ~he key word appearing above.</Paragraph> <Paragraph position="7"> Another mechanically compiled SI substitute still in current use is based on a similar idea and simply groups together all the titles concaining a same indexable word. Such Key-Word-out-of-Context (~WOC) indexes display the full texts of titles under a common beading. Figure 3 shows a KWOC sample generated from title-Like subject descriptions ac the Institute for Scientific Information, The appearance of KWOC indexes is more steep,abe but their bro~elng is much hindered by the lack of articulation of the Lengthy subject descriptions (titles). Without proper articulation, the recognition of the context immediately relevant to the index word becomes too slow.</Paragraph> <Paragraph position="8"> In 1966 the Institute for Scientific Information (ISI) introduced a different type of automatically compiled subject index called PERMUTERM Subject Index (PSI) OGarfieid, 19760, which at present is the main type of SI to the Science Cltation Index and other similar ISI publications. Two different negative dictionaries are used for producing ~his SI: a so called &quot;full stop list&quot; of words excluded from becoming headings as well as from being used as subordinate index entries, and a &quot;semi- stop List&quot; of ~ords of little informative value, which are noC allowed as headings but are used as index entries along with words ~ound neither in the full-stop uor in ~he semi-stop Lists. ~n the PSI every word Co-occuring wi~h the heading word in some</Paragraph> <Paragraph position="10"> Fragment of a KNOC index compiled from relatively long sub3ect descriptionsdeg The words priuted in lowerc~s letters are &quot;stop words,&quot; no~ used as inde~ headings.</Paragraph> <Paragraph position="11"> * ubJect description (tlCle) becomes an entry line subordlu~ted to this heading, The format of ~he PSI is illustrated in figure 4.</Paragraph> <Paragraph position="13"> Fragment of a PERMUTERM subject index to the Science Citation Index (Institute for Scientific Information), automatically compiled from ~it+-es.</Paragraph> <Paragraph position="14"> The index lines are words co-occucing in ci~les wk~ the heading ,#ord. The arrows indicate the first occurrence under the given heading of a pointer to given article.</Paragraph> <Paragraph position="15"> PSI has the unique ability to make possible the easy retrieval of all titles containing any given pair of informative words. This ability is similar to the ability of computerized online search systems to retrieve titles by any boolean combination of search terms. The corresponding PSI ability is available to PSI users who have been instructed about the principles used for compiling it. The naive user is more likely to utilize it as a browsing tool.</Paragraph> <Paragraph position="16"> When doing so, he may be inclined to perceive the subordinate word entries as being the immediate context of the headings. Used as a browsing tool, PSI may deliver relatively high percentage of false drops because of the lack of contextual information. Another shortcoming of the PSI is its relatively high cost due to its significant size which is proportional to the square of the average length of titles. The large number of entries subordinated to headings which are words of relatively high frequency makes the exhaustive scanning of entries under such headings a time consumLing procedure.</Paragraph> <Paragraph position="17"> An important advantage of all the above computer generated indexes over their manually compiled counterparts is the speed and essentially lower cost at which they are made available.</Paragraph> <Paragraph position="18"> All the above compilation procedures are based exclusively on the most trivial facts concerning the syntaxis and semantics of natural languages. They make use of the fact that texts are built of words, of the existence of words having purely syntactic functions and of the existence of lexlcal units of very little informative value. A common disadvantage versus the SI of traditional type is that the above procedures fall to provide articulated contexts which would be short enough and structurally simple enough to be easily 8rasped in the course of browsing.</Paragraph> <Paragraph position="19"> Certainly this problem can be solved by any systen which can perform the full syntactic analysis of titles or similar kinds of subject descriptions.</Paragraph> <Paragraph position="20"> From the syntactic tree of the title a brief articulated context can be produced for any given word of a title by detecting a subtree of suitable size which includes the given word. However, in the majority of cases the practical conditions of application of index compilation procedures are excluding the usage of full scale syntactic analysis, based on dictionaries containing the required morphological, syntactic and semantic information for all the lexical units of the processed input. For instance, ISI is processing annually for its mutlidisciplinary publications around 700,000 titles ranging in their subject orientation from science and technology to arts and humanities. The effort needed for the creation and maintenance of dictionaries covering several hundred thousands entries with a high ratio of appearance of new words would be excessive. Therefore, the automatic compilation of SI is practially feasible only on the basis of quite simplistic procedures based on &quot;negative&quot; dictionaries involving approximative methods of analysis which yield good results in ~he majority of cases, but are robust enough not to break down even in difficult cases.</Paragraph> <Paragraph position="21"> At one end of the range of problems involving natural language processsing are such as question answering which require a high degree of analytic sophistication and are based on a significant amount of domain dependent information formated in bulky lexicons. Such procedures appear to be applicable to texts dealing with rather narrow fields of knowledge in the same way as the high levels of iu-depth human expertise are usually limited to specific domains.</Paragraph> <Paragraph position="22"> On the other end of the spectrum are simple problems requiring much less domain dependent information and relatively low levels of &quot;intelligence&quot; (oefined as the ability to discuss comprehensive texts from gibberish); the corresponding procedures are usually applicable to wide categories of texts. For reasons explained above, we consider the problems of automatic compilation of subject indexes as belongin~ to this low end of the spectrum.</Paragraph> <Paragraph position="23"> In this framework we developed an automatic procedure for the compilaclon of a SI based on :he detection and usage of word phrases. The earlier stages of development of this Key-Word-Phrase subject index (KW?SI) have been reported elsewhere (Vladutz 19795. The procedure starts by detecting certain types of syntactically self-contained segments of the input text; such segments are expected to be semantically self-contained in view of the assumed well-formedness of the input. The segment detection procedure is based on a relatively short list of acceptable syntactic patterns, formulated in terms of markers attributable by a simple dictionary look-up.</Paragraph> <Paragraph position="24"> The markers are essentially ~he same as used in (Klein 1963) in the early days of machine translation for automatic grammatical coding of English words.</Paragraph> <Paragraph position="25"> All the words not found in an exlusion dictionary of &quot;~ 1,500 words are assigned the two markers ADJ and NOUN. All the acceptable syntactic patterns are characterized in the frameworks of a generative gr=--,~r constructed for title-type texts. Sucn texts are described as sequences of segments of acceptable syntactic patterns separated by arbitrary filler segments whose syntactic pattern is different from the acceptable ones. The analysis procedure leading to the detection of acceptable segments was formulated as a reversal of the generative grammar and is performed by a right =o left scanning. .~ew acceptable syntactic patterns can easily be incorporated into the generative grammar. It is envisaged to use in the future existing programs for automatically generating analysis programs from any specific variant of the grammar.</Paragraph> <Paragraph position="26"> The present list of acceptable syntactic patterns includes such patterns where noun phrases are concatenated by the preposition 'OF' anu the conjunctions 'AND', 'OR', 'AND/OR', as well as constructions of the type 'NPI, NP2, ... AND NPi'. Since no prepositions other than 'OF' and no conjunctions other than 'AND', 'OR', 'AND/OR' can occur in the acceptable segments the occurrences of other prepositions and conjunctions are used for initial delimitation of acceptable segments, but the detection procedure is not limited to such usage. In particular, a past participle or a group containing adve-bs followed by a past participle are excluded from the acceptable segment when preceuing an initial delimiter. The segmentation detection is illustrated for three titles in figure 5.</Paragraph> <Paragraph position="27"> The detection of acceptable segments is shown for 3 titles. The words with all lowercase letters are prepositions and conjunctions used as initial deli~Liters. The words with only initial capital letters are &quot;seml-stop&quot; words, excluded from being used as index headings; the underscored by dotted lines &quot;seml-stops&quot; are past participles which become dellminters only when followed by initial dellmi~ars. The resulting multl-word phrases are underscored ~wice unlike the resultlng single word phrases which are underscored once.</Paragraph> <Paragraph position="28"> The first part of the system's dictionary con-Junctions, prepositions, articles, auxiliary verbs and pronouns. Th/s part is completely domain independent. A second part of the dictionary consists of nouns, adjectives, verbs, present and past participles, all of them of little informative value and, therefore, called &quot;seml-stop&quot; words. Such words will not be allowtd later to become SI headings. The semi-stop par~ of the dictionary is somewhat domain-dependent and has to be atuned for different broad fields of knowledge such as science and technology, social sciences or arts and humanities.</Paragraph> <Paragraph position="29"> The second logical step in the SI compilation involves the transformation of acceptable segments into index records consisting of an informative word (not found in the system's dictionary) displayed as heading llne and of an index llne providing some relevant context for the headlng word. Each multi-word segment generates as many index records as many informative words it contains. The ri~ht-hand side of the segment following the heading word is placed at the beginning of the index line to serve as its i~nediate context and is followed through a senLicolon by the segment's left-hand side. When both sides are non-empty, an articulation of the index line is so achieved. In the case of a single word segment an &quot;expansion&quot; procedure is performed during index record generation. It starts by placing at the beginning of the index llne a fragment of the title consisting of the filler portion following the heading word and of the next acceptable segment, if any; this initial portion of the index line is followed by a semicolon after which follows the preceding acceptable segment, followed finally by the filler portion separating in the title this preceding segment from the heading word. The index record generation is illustrated in figure 6.</Paragraph> <Paragraph position="30"> final &quot;enrichment&quot; phase of the index record generatlon involves the additional display (in parenthesis) of the unused segments of the processed title.</Paragraph> <Paragraph position="31"> The transformation of Key-Word-Phrases into subJec~ headlngs and subject entries is illmstrated for the first two seFjnents of the title A, Figure 6. The last two examples snow how single word segments (from Title C) are expanded to incluoe the preceding and following them segemencs.</Paragraph> <Paragraph position="32"> As a result of this stage the informacioual value of the finally generated index record is almost equivalent to the information content of ~ne initial full title. The entire process ultimately boils down to the the reshuffling of some component segments of the initial title. The enric~ent stage of index record generation is illustrated on figure 7.</Paragraph> <Paragraph position="33"> The index records are alphabetized firstly by heading words and secondly by index lines with the exclusion from alphabetiza~ion of prepositions and conjunctions if they occur aC the beginning of index lines. During the photocomposition different parts of the index line are set usin~ different fonts. If in the original title the initial part of the index llne follows the head word i~nedlately this part is set in bold face italics, i.e. in the same font as the heading. The &quot;inverted&quot; part following the semicolon is set in light face roman letters. Finally the enrichment part of ~he index line, included in patens is always displayed in light-face italics. As a result the The enrichment of the subject entries by the display (in parenthesis) of the unused by them segments of the same title, illustrated for some of the entries of Figure 6.</Paragraph> <Paragraph position="34"> immediately relevan~ coutext of the head word is displayed in bold face in order to facilitate its rapid grasping when browsing. Details of the appearance and s~ructure of KWPSI are exemplified in figure 8 on a sample compiled for titles of publications dealing with librarianship and information science. The general appearance of KWPSI is close enough to the appearance of SI of traditional type.</Paragraph> <Paragraph position="35"> For purposes of transportability the KWPSI system is programmed in ANSA COBOL. It includes two modules: the index generation module and the sorting and reformatting module. On an IBM 370 system index records are generated for titles of scholarly papers at a speed of ~ 70,000 titles/hour. The resulting total size of the index is of the same order as the size of KWOC indexes and compares favorably with the size of the PSI index.</Paragraph> <Paragraph position="36"> The analysis of ~he rates and ~aCure of failures of ~he segment detection algorithm shows that in 96% of cases the generated segments are fully acceptable as valuable index entries. In 2% of cases some important information is lost as a result of the elimination of prepositions, as in case of expressions of 'wood to wood' type. The rest of failures results in somehow awkward segments which are not completely semantically self-contained. Even in such cases the index entries retain some informative value. Around half of the failures can be eliminated by additions to the system's dictionary, especially by the inclusion of more verbs and past participles. Not counted aS failures are the 5% of cases when the leng=h of the detected segments is excessive; such segments can include the whole title.</Paragraph> <Paragraph position="37"> The extent of tuning required for =he application of the system in a new area of knowledge depends mainly upon ~he extent of figure 8.</Paragraph> <Paragraph position="38"> A photocomposed k'WPSI sample showing details of i~s structure and appearance.</Paragraph> <Paragraph position="39"> deviations from the normal structure of natural language texts occurring in the new file. As a matter of fact all kinds of scholarly titles contain such deviations, as for instance portions of normal text included in parentheses or occurrences of mathematical or chemical symbols. We found only one case when the required tuning effort was siguifican~, namely the case of titles from ~he domain of arts and humanities. ISl's &quot;Arts and Humanities Citation Index&quot; includes besides ci~les of arclcles, also cicles of book reviews, as well as descriptions of musical performances and musical records, compiled accordin G co special rules. Many contain mulciword names of works of arC, cakes in quotation marks, which have ~o be handled as single words. A KWPSZ sample for ar~s and humaniCies is shown on figure 9 * Two more KWPSI samples are given on figure 4~ (science and Cect~ology ciCles) and figure 11 (&eoscience research f=on~ names).</Paragraph> <Paragraph position="40"> * ,,,,~ ......................... ,m,.. . ~,~&quot;-- ,,~'=~'P,.im ~ .~ ~'&quot; ~mim mmi~ ** n~Itmt~im...~u~mlc me~ii, ii--i~cj~,~m; mmm., ,~~ deg ~&quot; .............. &quot; .............. ~1~1 mmml-. ........... m.</Paragraph> <Paragraph position="41"> ............................ xp~m~ ......................... m ,m~mm m~i. . ................. ~ns~ s The common method of online access to commercially available textual databases of both bibliographic and full cexc type is through boolean queries formulaCsd in tern~s of single words.</Paragraph> <Paragraph position="42"> AuCo~Icically detectable word phrases of the cype used in the KWP$1 sysCem could be used in ~L~ree different rays for improving online access.</Paragraph> <Paragraph position="43"> One extreme way would involve the creation of word phrases of the above cype a= the input stage for every informative word of the Input. In response to a slnsle word query a sequence of screens would be shown displaying the image of whaC in a printed KWPSl would be the KWPSI section under ~he Given word ta~eL~ as headin G . After browsing online some par~ o~ tni~ up-to-dace online 51 the user could choose to limi~ further browsin G by respondln~ wits an additional search term, mos~ likely chosen from some of t~le already examined index entries. As a result ~he system would reply by ellmlna~ing from ~ne displayed output the encrlas aoc concalnin 8 che given word an~ the user would conclnue co browse che so ~rimmed display. Several such i~eratlons could be performe~ KW'PSI sample for names of geoscience research fronts.</Paragraph> <Paragraph position="44"> until the user would be left with the display of a SI to relevant items of the database. This SI would be then printed together with the full list of relevant items. It is ~hought that such kind of interaction could be more user friendly than the currently used boolean mode.</Paragraph> <Paragraph position="45"> Another way of using the KWPSI technique in an online environment would be to use the KWPSI format for the output of the results of a retrieval performed in a traditional boolean way. The query word which achieved the most strong trimming effect would be used as heading.</Paragraph> <Paragraph position="46"> A third way would involve the compression of a KWFSI section under a given heading before it is displayed in response to a word. One could e.g.</Paragraph> <Paragraph position="47"> retain only such noun phrases containing the given word which occur at least k times in the database. An example of such list for the &quot;Arts and Humanities&quot; database is given in figure 12. By displaying such lists of words closely co- occurring with a given one ~he system would perform thesaurus-type functions. The implementation of all such possibilities would be rather difficult for any existin 8 system in view of the effort required to reprocess past input. Instead, after the input of a query word the corresponding full text records could be called a not processed online in core for generating KWPSl-type index records. In this case all the above functions could be still performed.</Paragraph> <Paragraph position="48"> 'FILM(S)' and appearing at least twice in titles covered during a three months period by \[SI's ARTS and HUMANITIES CITATION INDEA. Such automatically compiled lists are suggested as search aides for the online access to databases.</Paragraph> <Paragraph position="49"> Another possibility which we are considering is to place the K~PSI-type processing capabilities into a microcomputer which is being used to mediate online searches in remote databases. All the text records containin8 a given (not too frequent) word coulu be initially tapped from the database into the microcomputer. Following that the microcomputer could perform all the above functions in an offline mode,</Paragraph> </Section> class="xml-element"></Paper>