File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/69/c69-5101_abstr.xml

Size: 8,304 bytes

Last Modified: 2025-10-06 13:45:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="C69-5101">
  <Title>A SEARCH ALGORITHM AND DATA STRUCTURE FOR AN EFFICIENT INFORMATION SYSTEM</Title>
  <Section position="2" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> This paper describes a system for information storage, retrieval, and updating, with special attention to the search algorithm and data structure demanded for maximum program efficiency. The program efficiency is especially warrantedwhen a natural language or a symbolic language is involved in the searching process.</Paragraph>
    <Paragraph position="1"> The system is a basic framework for an efficient information system.</Paragraph>
    <Paragraph position="2"> It can be implemented for text processing and document retrieval; numerical data retrieval; and for handling of large files such as dictionaries, catalogs, and personnel records, as well as graphic ~ informations. Currently, eight cor~nands are implementedand operational in batch mode on a CDC 3600: STORE, RETRIEVE, ADD, DELETE, REPLACE, PRINT, C(R4PP, ESS and LIST. Further development will be on the use of teletype console, CRT terminal, and plotter under a time-sharing environment for producing innnediate responses.</Paragraph>
    <Paragraph position="3"> The maximum program efficiency is obtained through a unique search algorithm and data structure, instead of examining the recall ratio and the precision ratio at a higher level, this efficiency is measured in the most basic term of &amp;quot;average number of searches&amp;quot; required for looking ~p an item. In order to identify an item, at least one search is necessary even if it is found the first time.</Paragraph>
    <Paragraph position="4"> Hc~ever, through the use of the hash-address of a key or keyword, in conjunction with an indirect-chaining list-structured table, and a large available space list, the average number of searches required for retrieving a certain item is 1.25 regardless of the size of the file in question. This is to be compared with 15.6 searches for the binary search technique in a 50,O00-item file, and 5.8 searches for the letter-table method with no regard to file size.</Paragraph>
    <Paragraph position="5"> *This study was supported in part by the National Science Foundation and the University of Wisconsin.</Paragraph>
    <Paragraph position="6"> Best of all, since the program can use the same technique for storing and updating informations, the maximum efficiency is also applicable to them wlth the same ease. Thus, it eliminates all the problems of inefficiency caused in establishing a file, and in updating a file.</Paragraph>
    <Paragraph position="7"> I. MOTIVATION In our daily life, there are too many instances of looking for some type of information such as checking a new vocabulary in a dictionary, finding a telephone number and/or an address in a directory, searching a book of a certain author, title, or subject in a library catalog card file, etc, Before the desired information is found, one has to go through a number of items or entries for close examination. The quantitative measurement is usually termed as the &amp;quot;number of searches&amp;quot;, &amp;quot;number of lookups&amp;quot;, or &amp;quot;number of file accesses&amp;quot; in mechanized information systems.</Paragraph>
    <Paragraph position="8"> HoWever, as King pointed out in his article in the Annual Review of Information Science and Technology, volmne 3, (pp.74-75) that the most cOmmon measures of accuracy of an information system are the recall ratio and precision ratio. These two measures have come under considerable criticism for their indifference in retrieval characteristics, being misleading and producing varying results. They probably should be used primarily to highlight a system's unsatisfactory perf~nce. From the failure analysis of Hooper, King, Lancaster and others, the reasons are: incorrect query formulation, indexing errors, mechanical errors, incorrect screening, etc.</Paragraph>
    <Paragraph position="9"> In the same volume (p. 139), Shoffner cc~nented on the evaluation of system, s that &amp;quot;it is important to be able to determine the extent to which file structures and search techniques influence the recall, precision, and other measures of syste~ performance&amp;quot;. Not until very recently, file structure and search techniques were apparently i unpopular topics among information scientists except Salton and a few others. Nevertheless, these topics have been attacked constantly by system scientists for a much smaller size of file but the maximt~ efficiency is a vital factor for the total system. They are frequently discussed under the title of &amp;quot;symbol table techniques&amp;quot;, or &amp;quot;scatter storage techniques&amp;quot; as used by Morris as the title of his article. In addition to the &amp;quot;number of searches&amp;quot; and the &amp;quot;number of lookups&amp;quot; other terminologies used by the syste~ scientist for referencing the most basic measure are the &amp;quot;number of probes&amp;quot;~ the &amp;quot;number of attempts&amp;quot;, and the &amp;quot;search length&amp;quot;.</Paragraph>
    <Paragraph position="10"> Ever since 1964 the author stepping into the cemputer profession noticed that the efficiency of a file handling system is always crippled by its file searching technique no matter how sophisticated the system. This was especially the case during 1965 and 1966 when the author was employed at the Itek Corporation on an Air Force project of a Chinese to English machine translation experiment. The best search technique used for dictionary lookups was the binary search which is still considered one of the best techniques available today.</Paragraph>
    <Paragraph position="11"> For a large file with a huge number of records, entries or items, the binary search technique will still yield a substantial number of searches which is a function of the file size. The typical files are: dictionaries of any sort, telephone directories, library catalog cards, personnel records, merchandise catalogs, doct~ment collections, etc. For example, in a 50,000-entry file system the average number of searches for finding an entry is 15.6 calculated as log2N. This figure will not be very satisfactory if frequent search inquiries to a file are the case. As a result to finding better search techniques, at least three kinds of search techniques or algorithms are found to be more satisfactory than the binary search.</Paragraph>
    <Paragraph position="12"> Namely they are: lamb and Jacobson's &amp;quot;Letter Table Method&amp;quot;, Peterson's &amp;quot;Open-Addressing Technique&amp;quot;, and Johnson's &amp;quot;Indirect Chaining Method&amp;quot;. They have a rather interesting c~on feature that the file size is no longer a factor in the search efficiency.</Paragraph>
    <Paragraph position="13">  -3-IIo EFFICIENCY OF VARIOUS SEARCH AIEORITHMS In order to have a gross understanding of various search algorithms, six of them are examined and compared in respect to their search effieieneies.</Paragraph>
    <Paragraph position="14"> i. Linear Search This is also called sequential search or sequential scan.</Paragraph>
    <Paragraph position="15"> The linear search of an unordered list or file is the simplest one, but is inefficient because the average number of searches for a given entry in a N-entry file will be N/2. For example, if N = 50,000, the average number of searches for a given entry is an enormous 25,000. It is assumed that the probability of finding a given entry in the file is one. The average number of searches in a linear search is calculated as: S N + 1 or S = _~N 2 2 if N is a large numberdeg The linear search has to be performed in a consecutive storage area and this sOmetimes causes certain inconvenience if the required storage area is very large. The inconvenience can be avoided by using the last cOmputer word (or some bits of it) to index the location of the next section of sto~age area used and thus form a single chain for searching. This variation of the linear search method is called the single chain method. It differs from the linear search in storage flexibility but is otherwise the same in the efficiency.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML