File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/69/c69-7602_metho.xml

Size: 27,660 bytes

Last Modified: 2025-10-06 14:11:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="C69-7602">
  <Title>For more information on this system, write to</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
LANGIJAGE
</SectionTitle>
    <Paragraph position="0"> While I had previously written a context-free generator in assembly language, these programs were written in SNOBOL3, which is intended specifically for string processing. There were both advantages and disadvantages inherent in this choice. The language provides simple, powerful operations for parsing strings and allows easy defination of push-down stacks and lists. In addition, a primitive was available which recognizes strings balances with respect to parentheses, Because of this, I chose to represent trees as fully parenthesized strings. What is more important than this is the fact that SNOBOL manages all storage automatically, Thus the program has almost no pre-defined limits, The major disadvantage of my choice was that I was completelyinexperienced in the language and unfamiliar with the recursive techniques permitted. Thus the program was extraordinarily inefficient,</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE PROGRAM
</SectionTitle>
    <Paragraph position="0"> Bot the context-sensitive generator and the transformations program were written in two separate stepts, one converting the rules from a form as similar to that used by the linguist as possible to a form convenient for storing on the machine. In general, all rules containing abbreviatory devices (braces and parentheses) were expanded to a number of sub-rules. These were then punched out and used as input to the generator itself. The size of the programs is  In addition, each program contained .approximately one comment for each 4 statements since this was the only way I could understand what I really intended to write. These were written as the program was written and proved invaluable. I should note that I generally over-document my programs as r tend to borrow algorithms from them years later.</Paragraph>
    <Paragraph position="1"> lz</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
DEVELOPMENT TIMES
</SectionTitle>
    <Paragraph position="0"> The CS program took about six weeks to get running while the transfor:'~ations program !ook the better part of four months before it worked well enough that I could attempt to transform a &amp;quot;real&amp;quot; sentence. Due to personal reasons, I was unable to debug the grammar/program well enough to consider distribution. The cost of processing a real tree was also prohibitive (20 minutes at 7094 time). I again note that this was caused more by my inexperience with SNOBOL than by any faults inherent in the language. The work could hardly be done so quickly in assembly language or FORTRAN as I first would have to write a large set of subroutines for string handling, input-output, etC.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
INFLUENCE OF THE LANGUAGE
</SectionTitle>
    <Paragraph position="0"> The strongest external influences on the program was the fact that data must be punched on cards. Thus a two dimensional notation, as used by Fromkin and Rice (preprint 53) for example, seemed too difficult to program to warrant the effort. Trees and rules thus must be written in a linear manner, using parentheses for structure identification.</Paragraph>
    <Paragraph position="1"> (Though the programmer may indent items when punching. ) Any other limitations were primarily caused by my inexperience. Note especially that there is no limitation on the number of characters in a string.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
CONVERSION
</SectionTitle>
    <Paragraph position="0"> Until I found out about Friedmanls program (preprint 14) I had considered rewriting the transformations program in SNOBOL4 -- a string processing language similar to, but incompatable with, its predecessor. It seems, however, that only the algorithm (preserved in the commentary) could be transferred as SNOBOL4 s increased capabilities allowed a much more efficient approach to tree parsing.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="20" type="metho">
    <SectionTitle>
OPERATING SYSTEMS
</SectionTitle>
    <Paragraph position="0"> While it would be very nice to be able to generate sentences in a time-sharing environment, I feel that the Ianguages currently available and especially the amount of work that must go into interfacing programs with operating systems preclude any such effort at the present time. One solution is to have full control over a small computer, however, this may excessively limit the size of the program. While there doesn't seem to be any clear-cut answer, I seem to be reluctantly choosing the power and nit-picking of the large operating system so as not to limit the programs. I hope somebody convinces me that I am wrong.</Paragraph>
    <Paragraph position="2"> It may bc too late for you to include this in your summary, but now it's writte~ I send it anyhow.</Paragraph>
    <Paragraph position="3"> The first step in the investigation reported in my paper, CA 3.3, is a urogram for sorting e. text into words and word delimiters, and ~hus qualifies as a data processing program. In my vie:v, this program presents the features you are interested in to ~ higher degree th~n -the following programs which are &lt;~atamatically much simpler, involving only mcnipulations of the numerical codes for words and * ~etermlned by the first progr~,m~ - Lin~ulstically of other symbols a deg course the later programs contain all the essentials.</Paragraph>
    <Paragraph position="4"> Th,~. ,~ai:'~m-r~ ~va.~izble to me,,at Cooenhagen University is a GiZR (~;~_:ish make) with a central store of-4096 40-bit c,L!s ~nC/ po~iphe-r&lt;~l stere of ab. 300 00O c ~l!s. Programming is done in Llgol with ,-::-~ensio:z~ which make the single bits of each word easily nccessibie, T.hc datam~_t has no opera.tot, but is aveilab!o to the personnel of se-,e::'ol &amp;quot; ~&amp;quot;-'&amp;quot; ~ ~ - Jn~i~u~c~ which undoubtedly contributes to more ~reoue-_ t &amp;quot;ethnical breakdowns then comparable operator-managed datamats have.</Paragraph>
    <Paragraph position="5"> The text v.-~ oh J-mined on 6-pociticn p.r.per %ape ,,'id~out pari'~:; chc3k, ~.nd it &amp;quot;ook some ingenuity to convert it to the usu.&lt;i 8.degpos-i-~io:.\] ~:pe..~ll the same it is much che,~per to get the text ou these T,z'intf.ng nrckino oroduce,? tapes than to cod ~. them a:*ev;, The convc,~'ion w,~s m.~,\]e not to f!exov:ritcr code, but ,..,ith let%ars coded in ~ &amp;quot; &amp;quot;~'~ .\]p::.. ..... I~ order ,'a=1, b=2 etc ) o.nd other symbols with vnlres fco~ ~..~ un:7:-.r&lt;i3.. '.. word ~. ..... {'efined ns a ~qu~o ~ ....... of letters at :nest inte::'rup~ed ~',.., n ....... lower c~e sy~2bc! after the first; and each &amp;quot; cO to ,m --~ :fou:~? ::n. the t_~::t is first store,4 in .'.u ~r~.v (zz le~:;Lb be on the saic ...... ). Who:l a non-letter symbol is found the word is then conver-'e ,~ to storage for~:~.: 5 letters arc plnc,,q in the first 25 bitu of &amp;quot; c ~- i i t:.: n,::.:t t',;o b&amp;quot; ~s i:'/5.oatr~ whether ~.h.. word hzs no more th,&lt;~z 5 -c ca~.'rs, '.znd if not ..... th.,_ ~,~lS is the first part of the word or ,'-~, i,q'tqr~ I/! bits are left empty if it is the first pc~r% of the word~ el2o t':.o : ~:~-~' i.&amp;quot;:t+ei:: ' -r-~ c%c:red~ :~nd one bit indicates whe~her it is &amp;quot;~'~,.~,. en! ~.~-14~o,,,. ',.-oi'(i ~.._ not,.</Paragraph>
    <Paragraph position="6"> I.,'o:.v co:,,es \[:\]te ,:'J.otior..~,t./ l,,'~ok--uD- :,:!ith o, word stored like described the alp}!nbetio orderin,q coincides with ~z numerical ordering whe'n cells -ire interpreted as integers (%he \[,it which in inte{~ei ~ mode indicates sign is ~l,'.,e.ys ie?t empty~ i,e. as +)~ The dictionary is stored in an array of lorsth 3deg90 to c\].\]on, room :for other variables ino!uding the word mrm'ay o~ loncth 60 men-~i-.r.ad abo'C/o~ it is numbered 201 A000 ~ - r~zser.s &amp;quot; . Each ~e,: r word found is - _.o. exp\].a-\[ne a in l-~:e p,1.1ur -. . s-bored under t'_,e fiz-sb vacen, t &amp;quot;ltt:qbe??; &amp;quot;.11d :?&lt;;er,'C/ occurrence of it is indicated by this number in the output.</Paragraph>
    <Paragraph position="7"> %'he alph~.bctJo or/crime ::_s ta\]:en care of by list pl'ooessing&amp;quot; the vacant 12 bibs in the first part of a word :is uscPS to store the number ,of the next war(! in zlphnbatic order. To avoid having l;o ,Zo throuc-h-~he ,Thole dictionary an index is kept of initif~!s indicmting the nuzTbcr of the first word with each initial. (Some reduction of search time ceulc~ undoubtedly be obtained if ~he initiols were sub-divided by the value of the next letter~) The ou%~a% of -6he program consists of the dictionary, number and letter sequence for ca.oh %~ord~ ordered either by numbers or alphabeticel!y, and the processed text sZring, words given by their number above 200, other symbols by their number below I00, depending, on the ~aiue of ~'~ l;'qt case symbo? ,, space w}-ich only separates two words is suppressed~ other possibilities of reduction do not vresen% nearly the same reduction of space requirement.)</Paragraph>
    <Paragraph position="9"> * In this way, the central store will hold a dictionary of ab.</Paragraph>
    <Paragraph position="10"> 2500-3000 words (words up to 5 letters take one cell, with 6-12 letters they take 2 oells, 13-19 3 cells etc.) which in a unitary text will hold all but the very infrequent words.</Paragraph>
    <Paragraph position="11"> The program builds heavily on the type of datamat used, only the most general principles will be transferable. Some slight alterations have benn necesse~ry to enable the program to be run on another datamat of the same make which is operator-run (but still totally without time*sharing or similar devices). I cannot to any degree of accurmcy assess the initial time used for programming, ~ich was not excessive, or that for debugging, which was considerable. Both parts of the work ~rere done over a long period in b~tween other work.</Paragraph>
    <Section position="1" start_page="0" end_page="20" type="sub_section">
      <SectionTitle>
Gust~v Leunbach
</SectionTitle>
      <Paragraph position="0"> Answers to questionnaire ~5 Type of Project: Modelling of Linguistic System Language: Assembly language, IBM 360/65 I, 5DOK (h.s.) The computer used works in a time snaring set up, in which we are one of ~ number of users; No comversational methods are used.</Paragraph>
      <Paragraph position="1"> Ch~}ice of Language: Owing to the fact that 50~ of t~e central core storage is permanently occupied by the time sharing system, the space available to our program is only 20~ to 251 K; this is barely sufficient; hence, in order to compress the job as much as possible, the entire application was programmed in assembly language. The formulation of the problem and the program, therefore are in many ways influenced or, rather, determined by the characteristics of the particular machine.</Paragraph>
      <Paragraph position="2"> Development time: The flow charting, defining of algoritJ~ms, and linguistic research had all been done previously, in three years work, for another machine {GE 425}; the time required to remodel the entire application for use on the IBM 360 was approximately 3 months. - The linguistic approach and the algorithmic formulations it requires and makes possible are highly unorthodox and, therefore, not at all suitable for formulation in an existing high-level language.</Paragraph>
      <Paragraph position="3"> The system works without vocabulary lobk-up; the sentences to be analysed are input on punched cards; such storage of words as occursduring the analysis procedure, is achieved by numeric code.</Paragraph>
      <Paragraph position="4"> There are no character manipulation subroutines; input and output definition is in IOCS. No external storage is used.</Paragraph>
      <Paragraph position="5"> The program is thoroughly defined by the sequence of operations determined by the linguistic procedure. null 5inca the application is exclusively experimental, there is continual exchange end modification of both program and algorithms; end the program was written, from the outset, with this in mind, i.e.</Paragraph>
      <Paragraph position="6"> allowing for easy alteration in many areas.</Paragraph>
      <Paragraph position="7"> Notes comprehensible only to the programmer who devised the program.</Paragraph>
      <Paragraph position="8">  5izo of Program: 3200 instructions, no commentary. The modQl 360 we are using disposes of approximately iBO machine instructions; the Multistore progra~ employs no more than 30 of these (of which about 6 or 8 could be reduced to others, so tha~ the total of used instructions could be brought down to approximately 22, or 12~ of the instructions available in the machine).</Paragraph>
      <Paragraph position="9"> This is a typical symptom of the situation of linguistic, artificial intelligence, artificial perception, etc., programming in general: the machines actually available are far too complicated, i.e. they can do innumerable things which are not needed in that kind of program; on the other hand, machines specially designed for these tasks would have to have larger central cores.</Paragraph>
      <Paragraph position="10"> No doubt processing times could be greatly shortened on special purpose machines.</Paragraph>
      <Paragraph position="11"> Time sharing: Yes. If we had a console in our office, it certainly would save time.</Paragraph>
      <Paragraph position="12"> Job Control: 5ince the computer has to be used by other people and for other tasks as well, one has to accept job control; if we had a computer exclusively for this particular use of ours, we should do away with job control.</Paragraph>
      <Paragraph position="13"> Change of Machine: Since the program was to some extent determined by the particular machine (capacity, byte configuration, etc.) we are using, it is not transferable to another type. Being purely experimental, this was no.._~t an objective.</Paragraph>
      <Paragraph position="14"> Language: Yes. The requirements being so very specific (see above) programming in a machine oriented language is essential.</Paragraph>
      <Paragraph position="15"> Teaching: No.</Paragraph>
      <Paragraph position="16"> Linguist Programmers: No. The analytical work to be done to understand the workings of natural language is still so enormous that they should not scatter their attention and efforts; they should, however, have fairly clear ideas about what can and what cannot be implemented on a computer and, above all, how minutely all formulations of linguistic rules have to be defined, if they are to work satisfactorily on a computer.</Paragraph>
      <Paragraph position="17">  In reply to some of your ideas expressed in &amp;quot;Metaprlnt&amp;quot;, I -,- semdlng a brief history of the phonological testing program (see preprlnt #53). The program has been through several translations and parts of it have actually run on two machines whPS1e other parts have not yet been coded.</Paragraph>
      <Paragraph position="18"> * he project began about a year ago, when a fairly simple program ~ms written in Super Basic on the Tymshare, Inc. timesharing system. That program accepted a single test form, placing it in a binary matrix of feature values. Rules were written directly in Super Basic coding, performing the desired operations on the bit matrix. Later in the school year we decided to try to set up a similar system on the IBM 360/91 on campus and the Super Basle program ~s rewritten in PL/I. This program was still simply a rule executor and the rules had to be coded in PL/I. IM\[ffi~'ultles with the IBM system led to the abandonment of this project. There were two main causes here which lead into the current system called PHONOR. First, I was completely turned off by the IBM system performance (91 means 91% down time). The more important reason, however, is that I wanted more flexibility in the scope display than the primitive batch job system allowed, Durlmg the t~ume the proErem was being rewritten in PL/I, I was thinklnE more and more about a better system of rule specification aQd input than coding in a standard computer language, not very suitable for a linguistic researcher to use the system. Some early th~aklng about the string matching process and a gradually ImprovPSng knowledge of Chomsky and Halle's SPE led to a rule cumpiler a~orltb~ which accepted a string of text stating the rule us~ a notation quite similar to the SPE format and produced as output a llst of matching proCess operations. I soon realized that these matehlng operations could be coded and stored fairly compactly as they were produced by the compiler and then read by a separate rule interpreter syst~ which contained the test matrix  and performed the matching operations in the order in which the compiler had stored them. This led to the present system written for the LINC-8 in our lab.</Paragraph>
      <Paragraph position="19"> Input to the compiler will be either from the teletype cr from a specified file on disk or mag tape. The input rules may be displayed on the scope in a two-dimensional format very close to the SP__EE formalism. This input may he edited, compiled or saved in a file. When the interpreter is loaded (by a single c~mmand to the compiler system) the most recently compiled set of rules is loaded. Operation of the interpreter is under complete interactive control of the linguistic researcher at the console, who may enter test forms, specify which rules to apply and set cr reset flags for various printout options as the interpreter runs. The compiler may also be recalled at any time.</Paragraph>
      <Paragraph position="20"> I have not added substantially to the basic compiler algorithm since writing the conference paper. I have worked out a subroutine generation system to take care of the case mentioned in the last paragraph. Actually most of the coding in the compiler is (will be) concerned with more mundane housekeeping tasks such as input text manipulation and settln E up storage for the coded output. As the program nears completion, I will definitely have clearer documentation of its structure and capabilities. I tend to avoid this as most progra~ners do unless I can get it done while I'm in the mood of blowing my horn (as now). Then it flows out pretty well. I hope to remain responsive to suggestions as the program is used and desire to make it available as widely as possible.</Paragraph>
      <Paragraph position="21">  The heart of t~s system is a rule expr~ ~Jon lar,~guage consisting of operations to be perforaed on the strir 4 of phonologlca? nits stored in the test matrix. These operations are described in the paper using the PL/I language and comprise push-down stack operations, unit match instructions, matrix modification instructions and various for=~ of branch instructions. The system actually consists of two parts ; I) A uompilor, which reads the rules as they are entered and translates them to the rule expression language, and II) An interpreter, which contains the test matrix, accepts a t%st string from the console ar~ interprets the rule expression language, modifying the test matrix as indicated by the rule coding. PHONOR is now being written for the Digital Equipment Corp. LINC-8 with two llnctape units and 8K of core memory. The interpreter is written in PDP-8 machine language and is now completed. The compiler is bein~ written in LINC IAP-6 assemb~ language and will be running sometime in October, 1969. One memory field (@K) is dedicated to storage of the rule expression coding when the interpreter is running. I expect to get 30 to 40 average sized rules in the memory field. Additional fields of r~les may be stored on Linctape and read in under program control. The present system has an upper limit of 128 rules.</Paragraph>
      <Paragraph position="22"> One item described in the paper which the present system will not. support is the notation&amp;quot; &amp;quot;X&amp;quot;, meaning any string of units not containlng the boundary symbol &amp;quot;#&amp;quot;. This would require a more complex matching algorithm than I have yet worked out. If it appears that such a notational device is useful it will be considered as a future extension. I hope to be able to include in the near future the capability of handling indexed disjunctions (angle brackets in Chomsky and Halle, SPE). This brings up a number of questinns relating to disjunctively ordered rules (as in SP~E) and the exact sequence of matchir~g units within a rule. PHONCR treats disjunctions somewhat differently than the system of SP__~ in that computational efficiency is g~ven priority over descriptive efficiency. I think it is a shortcoming of the current ideas on descriptive simplicity in a gray,nat that dynamic computational simplicity is not taken into account. It is my hope that future use of the PHONO~ system will help in setting up new models for overall operation~l simplicity in the phonological component.</Paragraph>
      <Paragraph position="23"> For more information on this system, write to  (preprint number 4) I refer to your metaprint entitled &amp;quot;Computerized Linguistics&amp;quot;. For your information I should like to answer the questions which you raise in so far as they apply to the SMART document retrieval system:  i. The SMART system is infox~nation retrieval oriented.</Paragraph>
      <Paragraph position="24"> 2. The system is programmed fop a batch processing computer (IBM 360 model 65) largely in Fortran IV, with some of inner routines and executive programs in assembly language.</Paragraph>
      <Paragraph position="25"> 3. The choice of language was determined by the programming systems available with our computer and the preferences of the programmers.</Paragraph>
      <Paragraph position="26"> 4. The planning, flowchartlng, and programming took approximately thmee years from 1961 to 196~, and a total of approximately i0 man years.</Paragraph>
      <Paragraph position="27"> 5. The total number of programming steps PSassembly language instructions) is approximately 150,000.</Paragraph>
      <Paragraph position="28"> 6. The proETam is not easily transferTable onto another machine.</Paragraph>
      <Paragraph position="29"> 7. For many years I have been teaching a graduate course entitled &amp;quot;Automatic Information Organization and Retrie null val&amp;quot; in which linguistic analysis pr6cedumes are used.</Paragraph>
      <Paragraph position="30"> I should he glad to participate in the panel session if it is held within the first couple of days of the Conference (since I must leave early). I shall be glad to amplify on the comments given above.  The program is used both for actual processing and for testing linguistic models.</Paragraph>
      <Paragraph position="31"> A complete program is running on an IBM 7044 computer (3ZK memory) and a new version is being written for the IBM 360-67.</Paragraph>
      <Paragraph position="32"> II- LANGUAGE Programming for the 7044 were written in IV~AP (macro assembly language). The program consists of eight steps, along with a supervisor embedded in the IBSYS (IBJOB) system which interfaces the different programs with each other and with input-output devices.</Paragraph>
      <Paragraph position="33"> This is, of course, a batch-processing system.</Paragraph>
      <Paragraph position="34"> In the new program, the most important algorithms, which have to be very efficient, will be written in assembler language. Auxiliary programs will be written in PLt. This program must run both under conversational mode (using Cp-Cms system) and batch-processing mode. Conversational mode will be used for debugging and for testing linguistic models, while batch-processing will only be used for production.</Paragraph>
      <Paragraph position="35"> The language choice never influences the problem defination. III- STRUCTURE of the program The program is composed of eight different steps, each roughly corresponding to a particular linguistic model.</Paragraph>
      <Paragraph position="36">  4- the output text.</Paragraph>
      <Paragraph position="37"> The last three are encoded to preserve program efficiency. Grammars, for example, may be pre-compiled by a special subroutine.</Paragraph>
      <Paragraph position="38"> It is also necessary to provide auxiliary programs; giving input, output, and if necessary, intermediary results a human- readable form.</Paragraph>
      <Paragraph position="39"> Thus, we need to write two different types of programs, tl~e processor -- which must be very efficient and isiusually quite short -- IS written in assembler language. The auxiliary programs -- which need not be particutarily efficient, but must be easily modifiable -- are written in a problem-oriented language, PLI. The tatter represent 60% of the programming work (including compilers, text file updating, dictionaries, etc.)</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="20" end_page="23" type="metho">
    <SectionTitle>
IV - TIME REQUIRED
</SectionTitle>
    <Paragraph position="0"> This depends on the nature of the step. In the case of syntactical analysis, probably the most important, the following roughly holds: statement of proglem: about two or three years defining data structures and system programming: six months programming and debugging the algorithm: one year programming and debugging auxiliary programs: one year computer time for program debugging: ten hours (7044) The complete 7044 program, including all eight steps, contains about 65 000 machine instructions, 20 000 for the program, 45 000 for auxitiary routines.</Paragraph>
    <Paragraph position="2"> After the 7044 progra~ s &amp;quot;~ e debugged, we began changing to the 360-67. %%e are trying to convert all algorithms directly. The rnoe,~ important changes are relative to data managment. We had many problems with tape devices for the files and feel that the direct-access capabilities of the newer machine will prove very useful.</Paragraph>
    <Paragraph position="3"> In writing the first program, we were very cautious about program efficiency. While this is, of course, important, it did become very time consuming for the linguistic debugging (of grammars) and dictionary updating. This was partly due to batch-processing. With the new computer, we shall always use conversational node for debugging. The program thus must be executable in both conversational and batch modes. The most important problem is to make the files comparable under both systems.</Paragraph>
    <Paragraph position="4"> PL! seems to give us all the power we need, but we intend to limit iss use to auxiliary programs.</Paragraph>
    <Paragraph position="5"> I think it is important to speak a little about artificial languages for linguistics. We were obliged to define special languages for this purpose.</Paragraph>
    <Paragraph position="6"> In some cases, we wrote a compiler; while in others, such as tree transformation, we used a sophisticated macro processor. Macro assembly is very attractive -the operations being easy to define, describe, and modify. In our case, language defining and macro .writing took only three months. Unfortunately, macro assembly is very slow and, in the case of the 360, not sufficiently powerful. We were thus obliged to write our own compiler, instead of using the IBM software directly.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML