if 
METAPRINT 3 
(ME TAMETAPRINT I) 
Responses to "COMPUTERIZED LINGUISTICS: KALF A COMMENTARY" 
- Martin Minow - 
Rather than attempt a summary of the replies to"metaprint" 1 
included here, I feel it would be more useful for me to discuss one of my 
programs. 
PURPOSE 
The program generates sentences from a generative (context-sensitive, 
transformational) grammar. It is almost identical in function to that pre- 
sented by Joyce Friedman in preprint 14. 
LANGIJAGE 
While I had previously written a context-free generator in assembly 
language, these programs were written in SNOBOL3, which is intended 
specifically for string processing. There were both advantages and dis- 
advantages inherent in this choice. The language provides simple, power- 
ful operations for parsing strings and allows easy defination of push- 
down stacks and lists. In addition, a primitive was available which re- 
cognizes strings balances with respect to parentheses, Because of this, 
I chose to represent trees as fully parenthesized strings. What is more 
important than this is the fact that SNOBOL manages all storage auto- 
matically, Thus the program has almost no pre-defined limits, 
The major disadvantage of my choice was that I was completelyin- 
experienced in the language and unfamiliar with the recursive techniques 
permitted. Thus the program was extraordinarily inefficient, 
THE PROGRAM 
Bot the context-sensitive generator and the transformations program 
were written in two separate stepts, one converting the rules from a 
form as similar to that used by the linguist as possible to a form con- 
venient for storing on the machine. In general, all rules containing 
abbreviatory devices (braces and parentheses) were expanded to a number 
of sub-rules. These were then punched out and used as input to the gene- 
rator itself. The size of the programs is 
CF generator (assembly) 
CS rule reader 
CS generator 
Transformations "rule reader 
Transformations executor 
2500 cards 
300 statements 
~ 360 
~ 280 
- 600 
In addition, each program contained .approximately one comment for 
each 4 statements since this was the only way I could understand what 
I really intended to write. These were written as the program was written 
and proved invaluable. I should note that I generally over-document my 
programs as r tend to borrow algorithms from them years later. 
lz 
DEVELOPMENT TIMES 
The CS program took about six weeks to get running while the trans- 
for:'~ations program !ook the better part of four months before it worked 
well enough that I could attempt to transform a "real" sentence. Due to 
personal reasons, I was unable to debug the grammar/program well 
enough to consider distribution. The cost of processing a real tree 
was also prohibitive (20 minutes at 7094 time). I again note that this 
was caused more by my inexperience with SNOBOL than by any faults 
inherent in the language. The work could hardly be done so quickly in 
assembly language or FORTRAN as I first would have to write a large 
set of subroutines for string handling, input-output, etC. 
INFLUENCE OF THE LANGUAGE 
The strongest external influences on the program was the fact that 
data must be punched on cards. Thus a two dimensional notation, as 
used by Fromkin and Rice (preprint 53) for example, seemed too diffi- 
cult to program to warrant the effort. Trees and rules thus must be 
written in a linear manner, using parentheses for structure identification. 
(Though the programmer may indent items when punching. ) Any other 
limitations were primarily caused by my inexperience. Note especially 
that there is no limitation on the number of characters in a string. 
CONVERSION 
Until I found out about Friedmanls program (preprint 14) I had considered 
rewriting the transformations program in SNOBOL4 -- a string processing 
language similar to, but incompatable with, its predecessor. It seems, 
however, that only the algorithm (preserved in the commentary) could be 
transferred as SNOBOL4 s increased capabilities allowed a much more 
efficient approach to tree parsing. 
OPERATING SYSTEMS 
While it would be very nice to be able to generate sentences in a time- 
sharing environment, I feel that the Ianguages currently available and 
especially the amount of work that must go into interfacing programs with 
operating systems preclude any such effort at the present time. One 
solution is to have full control over a small computer, however, this 
may excessively limit the size of the program. While there doesn't 
seem to be any clear-cut answer, I seem to be reluctantly choosing 
the power and nit-picking of the large operating system so as not to limit 
the programs. I hope somebody convinces me that I am wrong. 
i3 
\]openhagen~ 28th August 1969. 
~J 
It may bc too late for you to include this in your summary, but 
now it's writte~ I send it anyhow. 
The first step in the investigation reported in my paper, CA 3.3, 
is a urogram for sorting e. text into words and word delimiters, and 
~hus qualifies as a data processing program. In my vie:v, this pro- 
gram presents the features you are interested in to ~ higher degree 
th~n -the following programs which are <~atamatically much simpler, 
involving only mcnipulations of the numerical codes for words and 
• ~etermlned by the first progr~,m~ - Lin~ulstically of other symbols a ° 
course the later programs contain all the essentials. 
Th,~. ,~ai:'~m-r~ ~va.~izble to me,,at Cooenhagen University is a GiZR 
(~;~_:ish make) with a central store of-4096 40-bit c,L!s ~n¢ po~iphe-- 
r<~l stere of ab. 300 00O c ~l!s. Programming is done in Llgol with 
,-::-~ensio:z~ which make the single bits of each word easily nccessibie, 
T.hc datam~_t has no opera.tot, but is aveilab!o to the personnel of 
se-,e::'ol " ~"-'" ~ ~ - Jn~i~u~c~ which undoubtedly contributes to more ~reoue-_ t 
"ethnical breakdowns then comparable operator-managed datamats have. 
The text v.-~ oh J-mined on 6-pociticn p.r.per %ape ,,'id~out pari'~:; 
chc3k, ~.nd it "ook some ingenuity to convert it to the usu.<i 8.°pos-- 
i-~io:.\] ~:pe..~ll the same it is much che,~per to get the text ou these 
T,z'intf.ng nrckino oroduce,? tapes than to cod ~. them a:*ev;, 
The convc,~'ion w,~s m.~,\]e not to f!exov:ritcr code, but ,..,ith let%ars 
coded in ~ " "~'~ .\]p::.. ..... I~ order ,'a=1, b=2 etc ) o.nd other symbols with vnl- 
res fco~ ~..~ un:7:-.r<i3.. '.. word ~. ..... {'efined ns a ~qu~o ~ ....... of letters at 
:nest inte::'rup~ed ~',.., n ....... lower c~e sy~2bc! after the first; and each 
" cO to ,m --~ :fou:~? ::n. the t_~::t is first store,4 in .'.u ~r~.v (zz le~:;Lb 
be on the saic ...... ). Who:l a non-letter symbol is found the word is 
then conver-'e ,~ to storage for~:~.: 5 letters arc plnc,,q in the first 25 
bitu of " c ~- i i t:.: n,::.:t t',;o b" ~s i:'/5.oatr~ whether ~.h.. word hzs no 
more th,<~z 5 -c ca~.'rs, '.znd if not ..... th.,_ ~,~lS is the first part of the 
word or ,'-~, i,q'tqr~ I/! bits are left empty if it is the first pc~r% of 
the word~ el2o t':.o : ~:~-~' i.":t+ei:: ' -r-~ c%c:red~ :~nd one bit indicates 
whe~her it is "~'~,.~,. en! ~.~-14~o,,,. ',.-oi'(i ~.._ not,. 
I.,'o:.v co:,,es \[:\]te ,:'J.otior..~,t./ l,,'~ok--uD- :,:!ith o, word stored like describ- 
ed the alp}!nbetio orderin,q coincides with ~z numerical ordering whe'n 
cells -ire interpreted as integers (%he \[,it which in inte{~ei ~ mode ind- 
icates sign is ~l,'.,e.ys ie?t empty~ i,e. as +)~ The dictionary is stor- 
ed in an array of lorsth 3°90 to c\].\]on, room :for other variables in- 
o!uding the word mrm'ay o~ loncth 60 men-~i-.r.ad abo'¢o~ it is numbered 
201 A000 ~ - r~zser.s " . Each ~e,: r word found is - _.o. exp\].a-\[ne a in l-~:e p,1.1ur -. . 
s-bored under t'_,e fiz-sb vacen, t "ltt:qbe??; ".11d :?<;er,'¢ occurrence of it is 
indicated by this number in the output. 
%'he alph~.bctJo or/crime ::_s ta\]:en care of by list pl'ooessing" the 
vacant 12 bibs in the first part of a word :is usc£ to store the num- 
ber ,of the next war(! in zlphnbatic order. To avoid having l;o ,Zo 
throuc-h-~he ,Thole dictionary an index is kept of initif~!s indicmting 
the nuzTbcr of the first word with each initial. (Some reduction of 
search time ceulc~ undoubtedly be obtained if ~he initiols were sub- 
divided by the value of the next letter~) 
The ou%~a% of -6he program consists of the dictionary, number and 
letter sequence for ca.oh %~ord~ ordered either by numbers or alpha- 
beticel!y, and the processed text sZring, words given by their num- 
ber above 200, other symbols by their number below I00, depending, on 
the ~aiue of ~'~ l;'qt case symbo? ,, space w}-ich only separates two 
words is suppressed~ other possibilities of reduction do not vresen% 
nearly the same reduction of space requirement.) 
-2 - 14 
• In this way, the central store will hold a dictionary of ab. 
2500-3000 words (words up to 5 letters take one cell, with 6-12 let- 
ters they take 2 oells, 13-19 3 cells etc.) which in a unitary text 
will hold all but the very infrequent words. 
The program builds heavily on the type of datamat used, only the 
most general principles will be transferable. Some slight alterat- 
ions have benn necesse~ry to enable the program to be run on another 
datamat of the same make which is operator-run (but still totally 
without time•sharing or similar devices). I cannot to any degree of 
accurmcy assess the initial time used for programming, ~ich was 
not excessive, or that for debugging, which was considerable. Both 
parts of the work ~rere done over a long period in b~tween other work. 
Gust~v Leunbach 
Answers to questionnaire ~5 
Type of Project: Modelling of Linguistic System 
Language: Assembly language, IBM 360/65 I, 5DOK (h.s.) 
The computer used works in a time snaring set 
up, in which we are one of ~ number of users; 
No comversational methods are used. 
Ch~}ice of Language: Owing to the fact that 50~ of t~e central core 
storage is permanently occupied by the time shar- 
ing system, the space available to our program 
is only 20~ to 251 K; this is barely sufficient; 
hence, in order to compress the job as much as 
possible, the entire application was programmed 
in assembly language. The formulation of the pro- 
blem and the program, therefore are in many ways 
influenced or, rather, determined by the charact- 
eristics of the particular machine. 
Development time: The flow charting, defining of algoritJ~ms, and 
linguistic research had all been done previously, 
in three years work, for another machine {GE 425}; 
the time required to remodel the entire applica- 
tion for use on the IBM 360 was approximately 3 
months. - The linguistic approach and the algo- 
rithmic formulations it requires and makes pos- 
sible are highly unorthodox and, therefore, not 
at all suitable for formulation in an existing 
high-level language. 
The system works without vocabulary lobk-up; the 
sentences to be analysed are input on punched 
cards; such storage of words as occursduring the 
analysis procedure, is achieved by numeric code. 
There are no character manipulation subroutines; 
input and output definition is in IOCS. No extern- 
al storage is used. 
The program is thoroughly defined by the sequence 
of operations determined by the linguistic pro- 
cedure. 
5inca the application is exclusively experimental, 
there is continual exchange end modification of 
both program and algorithms; end the program was 
written, from the outset, with this in mind, i.e. 
allowing for easy alteration in many areas. 
Notes comprehensible only to the programmer who 
devised the program. 
16 
5izo of Program: 3200 instructions, no commentary. 
The modQl 360 we are using disposes of approxi- 
mately iBO machine instructions; the Multistore 
progra~ employs no more than 30 of these (of 
which about 6 or 8 could be reduced to others, 
so tha~ the total of used instructions could be 
brought down to approximately 22, or 12~ of the 
instructions available in the machine). 
This is a typical symptom of the situation of 
linguistic, artificial intelligence, artificial 
perception, etc., programming in general: the 
machines actually available are far too compli- 
cated, i.e. they can do innumerable things which 
are not needed in that kind of program; on the 
other hand, machines specially designed for these 
tasks would have to have larger central cores. 
No doubt processing times could be greatly short- 
ened on special purpose machines. 
Time sharing: Yes. If we had a console in our office, it cer- 
tainly would save time. 
Job Control: 5ince the computer has to be used by other people 
and for other tasks as well, one has to accept 
job control; if we had a computer exclusively for 
this particular use of ours, we should do away 
with job control. 
Change of Machine: Since the program was to some extent determined 
by the particular machine (capacity, byte confi- 
guration, etc.) we are using, it is not transfer- 
able to another type. Being purely experimental, 
this was no.._~t an objective. 
Language: Yes. The requirements being so very specific (see 
above) programming in a machine oriented language 
is essential. 
Teaching: No. 
Linguist Programmers: No. The analytical work to be done to under- 
stand the workings of natural language is still 
so enormous that they should not scatter their 
attention and efforts; they should, however, have 
fairly clear ideas about what can and what cannot 
be implemented on a computer and, above all, how 
minutely all formulations of linguistic rules have 
to be defined, if they are to work satisfactorily 
on a computer. 
L 
4, 
17 
Lloyd Rice 
1929 12th Street - Apt. A 
Santa Monica, California 90404 
August 26, 1969 
(preprint number 53) 
In reply to some of your ideas expressed in "Metaprlnt", I 
-,- semdlng a brief history of the phonological testing program 
(see preprlnt #53). The program has been through several 
translations and parts of it have actually run on two machines 
wh£1e other parts have not yet been coded. 
• he project began about a year ago, when a fairly simple program 
~ms written in Super Basic on the Tymshare, Inc. timesharing 
system. That program accepted a single test form, placing it in 
a binary matrix of feature values. Rules were written directly 
in Super Basic coding, performing the desired operations on the 
bit matrix. Later in the school year we decided to try to set 
up a similar system on the IBM 360/91 on campus and the Super 
Basle program ~s rewritten in PL/I. This program was still 
simply a rule executor and the rules had to be coded in PL/I. 
IM\[ffi~'ultles with the IBM system led to the abandonment of this 
project. There were two main causes here which lead into the 
current system called PHONOR. First, I was completely turned 
off by the IBM system performance (91 means 91% down time). The 
more important reason, however, is that I wanted more flexibility 
in the scope display than the primitive batch job system allowed, 
Durlmg the t~ume the proErem was being rewritten in PL/I, I was 
thinklnE more and more about a better system of rule specification 
aQd input than coding in a standard computer language, not very 
suitable for a linguistic researcher to use the system. Some 
early th~aklng about the string matching process and a gradually 
Improv£ng knowledge of Chomsky and Halle's SPE led to a rule 
cumpiler a~orltb~ which accepted a string of text stating the 
rule us~ a notation quite similar to the SPE format and produced 
as output a llst of matching proCess operations. I soon realized 
that these matehlng operations could be coded and stored fairly 
compactly as they were produced by the compiler and then read by a 
separate rule interpreter syst~ which contained the test matrix 
Martin Minow 2 26 August 1969 
f8 
and performed the matching operations in the order in which the 
compiler had stored them. This led to the present system written 
for the LINC-8 in our lab. 
Input to the compiler will be either from the teletype cr from a 
specified file on disk or mag tape. The input rules may be 
displayed on the scope in a two-dimensional format very close to 
the SP__EE formalism. This input may he edited, compiled or saved 
in a file. When the interpreter is loaded (by a single c~mmand 
to the compiler system) the most recently compiled set of rules 
is loaded. Operation of the interpreter is under complete inter- 
active control of the linguistic researcher at the console, who 
may enter test forms, specify which rules to apply and set cr 
reset flags for various printout options as the interpreter runs. 
The compiler may also be recalled at any time. 
I have not added substantially to the basic compiler algorithm 
since writing the conference paper. I have worked out a subroutine 
generation system to take care of the case mentioned in the last 
paragraph. Actually most of the coding in the compiler is (will 
be) concerned with more mundane housekeeping tasks such as input 
text manipulation and settln E up storage for the coded output. 
As the program nears completion, I will definitely have clearer 
documentation of its structure and capabilities. I tend to avoid 
this as most progra~ners do unless I can get it done while I'm in 
the mood of blowing my horn (as now). Then it flows out pretty well. 
I hope to remain responsive to suggestions as the program is used 
and desire to make it available as widely as possible. 
Sincerely yaurs, 
D~ :ksc 
FHONC< 19 
The Ln%eracttve Phor, ole~ic /ul ste~ 
The heart of t~s system is a rule expr~ ~Jon lar,~guage consisting of operations 
to be perforaed on the strir 4 of phonologlca? nits stored in the test matrix. 
These operations are described in the paper using the PL/I language and comprise 
push-down stack operations, unit match instructions, matrix modification instr- 
uctions and various for=~ of branch instructions. The system actually consists of 
two parts ; I) A uompilor, which reads the rules as they are entered and translates 
them to the rule expression language, and II) An interpreter, which contains the 
test matrix, accepts a t%st string from the console ar~ interprets the rule expression 
language, modifying the test matrix as indicated by the rule coding. 
PHONOR is now being written for the Digital Equipment Corp. LINC-8 with two 
llnctape units and 8K of core memory. The interpreter is written in PDP-8 machine 
language and is now completed. The compiler is bein~ written in LINC IAP-6 assemb~ 
language and will be running sometime in October, 1969. One memory field (@K) is 
dedicated to storage of the rule expression coding when the interpreter is running. 
I expect to get 30 to 40 average sized rules in the memory field. Additional fields 
of r~les may be stored on Linctape and read in under program control. The present 
system has an upper limit of 128 rules. 
One item described in the paper which the present system will not. support 
is the notation" "X", meaning any string of units not containlng the boundary symbol 
"#". This would require a more complex matching algorithm than I have yet worked 
out. If it appears that such a notational device is useful it will be considered 
as a future extension. I hope to be able to include in the near future the capa- 
bility of handling indexed disjunctions (angle brackets in Chomsky and Halle, SPE). 
This brings up a number of questinns relating to disjunctively ordered rules (as in 
SP~E) and the exact sequence of matchir~g units within a rule. PHONCR treats 
disjunctions somewhat differently than the system of SP__~ in that computational 
efficiency is g~ven priority over descriptive efficiency. I think it is a short- 
coming of the current ideas on descriptive simplicity in a gray,nat that dynamic 
computational simplicity is not taken into account. It is my hope that future use 
of the PHONO~ system will help in setting up new models for overall operation~l 
simplicity in the phonological component. 
For more information on this system, write to 
D. Lloyd Rice 
Phoneti~ Lab, Humanities Bldg. 1110 
UCIA 
Los Angeles. Ca. 90024 
\[nON HALL 
D~pART~RNT OF CO~dPUTER SCIENCE 
CORNELL UNIVERSITY 
ITHACA, N.Y. 14850 
Telephone (607) 
25~4117 
August 18, 1969 
20 
(preprint number 4) 
I refer to your metaprint entitled "Computerized Linguistics". 
For your information I should like to answer the questions which you 
raise in so far as they apply to the SMART document retrieval system: 
i. The SMART system is infox~nation retrieval oriented. 
2. The system is programmed fop a batch processing computer 
(IBM 360 model 65) largely in Fortran IV, with some of 
inner routines and executive programs in assembly language. 
3. The choice of language was determined by the programming 
systems available with our computer and the preferences 
of the programmers. 
4. The planning, flowchartlng, and programming took approxi- 
mately thmee years from 1961 to 196~, and a total of 
approximately i0 man years. 
5. The total number of programming steps £assembly language 
instructions) is approximately 150,000. 
6. The proETam is not easily transferTable onto another 
machine. 
7. For many years I have been teaching a graduate course 
entitled "Automatic Information Organization and Retrie- 
val" in which linguistic analysis pr6cedumes are used. 
I should he glad to participate in the panel session if it is 
held within the first couple of days of the Conference (since I must 
leave early). I shall be glad to amplify on the comments given above. 
Sincerely, 
t~ t •~ k 
Gerard Salton 
Prof~essor 
Computer Science 
GS:zl) 
G. Veillon Zi 
I- PROJECT Mechanical translation 
The program is used both for actual processing and for testing 
linguistic models. 
A complete program is running on an IBM 7044 computer 
(3ZK memory) and a new version is being written for the 
IBM 360-67. 
II- LANGUAGE 
Programming for the 7044 were written in IV~AP 
(macro assembly language). The program consists of 
eight steps, along with a supervisor embedded in the 
IBSYS (IBJOB) system which interfaces the different 
programs with each other and with input-output devices. 
This is, of course, a batch-processing system. 
In the new program, the most important algorithms, 
which have to be very efficient, will be written in 
assembler language. Auxiliary programs will be 
written in PLt. This program must run both under 
conversational mode (using Cp-Cms system) and 
batch-processing mode. Conversational mode will 
be used for debugging and for testing linguistic models, 
while batch-processing will only be used for production. 
The language choice never influences the problem defination. 
III- STRUCTURE of the program 
The program is composed of eight different steps, each 
roughly corresponding to a particular linguistic model. 
These are: 
- ! - pre-editing 
Z - dictionary look-up 
3 - morphological analysis 
4 - syntactical analysis 
5 - tree transformations (intermediate language) 
6 - syntactical generation 
7 - morphological generation 
8- post-editing 
22 
Each step requires: 
! - the program itself 
2 - the input text (_which is the output of 
the previous step) 
3 - linguistic parameters: grammar and lexicography 
4- the output text. 
The last three are encoded to preserve program efficiency. 
Grammars, for example, may be pre-compiled by a 
special subroutine. 
It is also necessary to provide auxiliary programs; giving 
input, output, and if necessary, intermediary results a 
human- readable form. 
Thus, we need to write two different types of programs, 
tl~e processor -- which must be very efficient and isiusually 
quite short -- IS written in assembler language. The 
auxiliary programs -- which need not be particutarily 
efficient, but must be easily modifiable -- are written 
in a problem-oriented language, PLI. The tatter 
represent 60% of the programming work (including 
compilers, text file updating, dictionaries, etc.) 
IV - TIME REQUIRED 
This depends on the nature of the step. In the case of 
syntactical analysis, probably the most important, 
the following roughly holds: 
statement of proglem: about two or three years 
defining data structures and system programming: 
six months 
programming and debugging the algorithm: one year 
programming and debugging auxiliary programs: 
one year 
computer time for program debugging: 
ten hours (7044) 
The complete 7044 program, including all eight steps, 
contains about 65 000 machine instructions, 20 000 for 
the program, 45 000 for auxitiary routines. 
23 
V - PROGRAM CONVERSION 
After the 7044 progra~ s "~ e debugged, we began 
changing to the 360-67. %%e are trying to convert 
all algorithms directly. The rnoe,~ important changes are 
relative to data managment. We had many problems 
with tape devices for the files and feel that the direct-access 
capabilities of the newer machine will prove very useful. 
In writing the first program, we were very cautious about 
program efficiency. While this is, of course, important, 
it did become very time consuming for the linguistic debugging 
(of grammars) and dictionary updating. This was partly 
due to batch-processing. With the new computer, we shall 
always use conversational node for debugging. The program 
thus must be executable in both conversational and batch modes. 
The most important problem is to make the files comparable 
under both systems. 
PL! seems to give us all the power we need, but we intend 
to limit iss use to auxiliary programs. 
I think it is important to speak a little about artificial 
languages for linguistics. We were obliged to define 
special languages for this purpose. 
In some cases, we wrote a compiler; while in others, 
such as tree transformation, we used a sophisticated 
macro processor. Macro assembly is very attractive -- 
the operations being easy to define, describe, and modify. 
In our case, language defining and macro .writing took 
only three months. Unfortunately, macro assembly is 
very slow and, in the case of the 360, not sufficiently 
powerful. We were thus obliged to write our own compiler, 
instead of using the IBM software directly. 
