JANUSZ STANISLAW BIEI(I 
TOWARDS COMPUTER SYSTEMS FOR CONVERSING 
IN POLISH 
1. PHILOSOPHY OF THE MARYSIA PROJECT 
1.1. Natural language communication trends in computer software. 
The progress in computer hardware in recent years has been enor- 
m.ous. Computers are now extremely fast and relatively cheap, the 
capacity of their storage has also been multiplied. These factors influence 
both the range of computer applications and the complexity of soft- 
ware. Computers are now used directly not only by mathematicians, 
physicists and data processing departments, but also by scientific work- 
ers of almost all domains of knowledge (including philology, philo- 
sophy, archaeology, etc.), managers and even sometimes laymen such 
as patients in hospitals. On the other hand, the great computational 
power of existing hardware allows us to develop very sophisticated 
systems for solving complicated problems, in a fully automatic manner 
or by means of interaction with man. There is no reason to doubt 
this is a steady trend in the computer world. We have to realize now 
that it means that man-machine communication will become more 
and more crucial in computer usage. First, if we cannot make commu- 
nication with computers easier, then the greater number of computer 
users requires the total cost of training to rise considerably. Secondly, 
even an excellent problem solver can be of no use if we do not develop 
the means for stating a problem correctly. The aim of research in prov- 
ing the correctness of programs and automatic program synthesis is 
to solve the software crisis by making the work of programmers easier, 
Acknowledgedment. The work described here has been done at the mathematical department of Warsaw University by the team consisting of S, Szpakowicz, W. Lu- 
kaszewicz and the author; in the early stage of development it was supervised by Pro£ S. Waligbsski. 
140 JANUSZ STANISLAW BIE~ 
It is not yet clearly realized that any result in the domain may only 
shift the burden from expressing ideas in programming language to 
doing the same but in another formalism. The following should prove 
the above statement. Let us consider the man-machine interaction pre- 
sented in the P,. W. FLO~D (1971) paper and try to design a formalism 
for it. It will become evident that if such a formalism exists, then because 
of its complexity it is not easier to express the ideas in it than just to 
program the problem. 
Our assumption is that the only long-term solution of these pro- 
blems is m~an-machine communication in natural language. It is, of 
course, an old idea. This has appeared in the COBOL design, a question- 
naire method of inquiries of men by computers and vice-versa, and in 
some question-answering and information retrieval systems, etc.. C. 
M~ADOW (1970, p. 141) has stated the following: " the lure of natural 
language communication is with us, and we may expect to see a con- 
tinuing trend towards its use, or its approximation, in all forms of man- 
machine communication ". We strongly believe this and this is the ge- 
nesis of the project aiming at developing the ~ARYSIA Polish language 
conversational system. 
1.2. What does " conversational system" mean. 
When the idea of time sharing was brand new, every session with 
any time sharing system was called a " conversation with a computer" 
It still happens that we meet the word "conversation" in this meaning, 
but it is better to distinguish interactive systems (and languages) and 
conversational systems. By the latter we mean a system which allows 
interaction in natural language, usually a limited language. This state 
ment requires some clarification. It can be understood in a broad 
sense as the following; every system you can communicate with in 
natural language is a conversational system. However, its narrow sense 
is more appealing. Let us consider for the moment the structure of a 
conversational system. I claim that for designing and debugging pur- 
poses such a large system is to be split into some modules with clearly 
established functions of modules and interfaces between them. One 
of the modules is to be a "brain" of the system, it determines the 
behavior of the system by controlling its slave modules. As a rule, for 
the purpose of portability and adaptability it should not have contact 
directly with the external environment. One (and often the only) 
TOWARDS COMPUTER SYSTEMS FOR CONVERSING IN POLISH 141 
method of interaction between the "brain" and the external world 
is to use a special conversational module. The module has as an input 
utterances of a natural language and translates them into a "brain" 
formalism or vice-versa (as proved by T. WINOGRAD, 1971, during the 
analysis of an utterance a feedback from the "brain " is desirable for 
efficiency). In most cases the "brain " can be just an already existing 
interactive system, in other cases the rest of the system may be espe- 
cially developed for making full utilization of the module possibilities, 
e.g. automatic resolution of ambiguities during text preparation for 
statistical computations or preprocessing of utterances before se- 
mantic and pragmatic analysis in an artificial intelligence system. Such 
a module is fairly complex and relatively independent of other parts 
of the system it is embedded in; therefore we prefer to consider it as 
a separate system and to refer to it as a conversational system. The 
MaaYSIA system is conversational in this narrower sense. 
1.3. The aim of the MARYSlA project. 
We consider the rich inflexion of the Polish language as the main 
difficulty in developing systems for man-machine communication in 
Polish. For example, it makes the questionnaire method inconvenient, 
because for psychological reasons we have to choose between two 
possibilities: either to allow only "yes "-" no " answers or to accept 
a considerable number of mismatches, caused by impermissible infle- 
xional forms. It is also not possible to develop any more sophisticated 
language processing system for Polish without implementing (or sim- 
ulating) algorithms of inflexional analysis and synthesis. Therefore 
the primary aim of the MARYSIA project was to break the barrier of 
inflexion. This means solving two problems. First, we had to design 
a formalism to talk about inflexion with sufficient precision. Secondly, 
we had to develop a general purpose system of practical use, with the 
ability to perform inflexional analysis and synthesis. These attributes 
of a system seem to be contradictory, but we found a way out. We 
have split the system into two parts with different functions. One part 
of it has to cover the morphological level of the language. This is an 
open ended part of the system, because there are only two restrictions 
on its adequacy: one is the computer storage available and the second 
is the necessity to describe the morphology by means of the notions 
designed by us for this purpose. It is important that the adequacy is 
142 JANUSZ STANISLAW BIEI~ 
not fixed at the moment of system generation but can be increased 
step by step, mainly by putting new items into the M^I~YSm diction- 
aries. The second part of the system has to serve temporarily as a 
means for "jumping over" the higher levels of language, such as syntax 
and semantics, and eventually pragmatics. At the moment it is rather 
primitive. It has been patterned after J. WEIZENBAUM'S EHZA systems 
(1966) which were interpreters for exchangeable scripts, consisting 
mainly of decomposition and reassembly rules; the difference is the rules 
of MAR'ZSL~)S scripts can refer to morphological descriptions of a word. 
The rich inflexion of Polish is here of some help, because many syntactic 
relations and some semantic facts are clearly reflected by morphology, 
and therefore even simple means can cover some parts of syntax and 
semantics. We do not know yet how large is the domain of syntax 
and semantics, which is reducible to morphology (and also to MARYSIA's 
script rules). To find this out as well as to recognize the practical appli- 
cability of a morphology based conversational system are the secondary 
aims of the MARYSIA project. 
2. MARYSIA~S LINGUISTIC PROBLEMS 
2.1. General assumptions. 
Automatic text processing of any kind forces us to face many lin- 
guistic problems of great importance. If a working system is required 
as a result of a project, then as a rule it is impossible to spend much 
time on working out solutions to all problems; in most cases we take 
for granted, sometimes even unconsciously, existing opinions. It is my 
feeling that we should not take for granted all our linguistic back- 
ground, because almost every project can verify or reject some linguis- 
tic statements, e.g. the work on a frequency dictionary can clarify 
some problems of homonymy, etc. The main theoretical point of the 
lVIARYSIA project was the concept of "word" 
There are many definitions of "word " in the linguistic literature. 
Why do we not want to use any of them? The reason is that all of 
them (at least all I know) are of no use when we want to decide whe- 
ther a given object is a word or not. Such a situation is fairly common 
in linguistics, let us take for example the well known definition of 
TOWARDS COMPUTER SYSTEMS FOR CONVERSING IN POrlSH 143 
"morpheme ": "the minimal meaningful unit of an utterance " and 
try to check that a given text is an utterance, that a given unit is mean- 
ingful and that it is minimal. After all, we do not have to accept the 
situation. How to avoid it then? It is necessary to fred a basis for lin- 
guistic researches, serving as the only criterion for evaluating the theo- 
ries. If we look for such a basis, we realize more than ever that lan- 
guage has no clear boundary in any aspect. There is no border between 
languages in space and time, there is no border between using language 
and other types of behavior (e.g. between understanding utterances 
and reasoning based on knowledge of reality); the opinions concerning 
perception of speech and handwritten letters have recently changed very 
much, so the concept of spoken or written utterance is no less vague 
than the "meaning ". What way out is there? Let us draw attention 
to the fact that a printed or typed text is quite different from any other 
kind of utterance, because it is in fact a string of characters from a 
finite, well defined alphabet. A new page, a change of type font, etc. 
can be considered as special letters in the alphabet, as is the case in com- 
puter composition systems. Therefore we can decide that every well 
printed text is equivalent to a computer-readable text of any form 
(paper or magnetic tape, text prepared for oCR readers, etc.). In the 
present state of the art the computer readable text is, in my opinion, 
the only basis for all linguistic research. In other words, if we consider 
Babbage as the father of computers, it is Gutenberg who is the father 
of linguistics (at least computational linguistics). 
A unit which can be defined strictly oll the basis of a computer 
readable text is a "word ", i.e. a string of characters between two deli- 
miters, e.g. punctuation marks. Such words are of different kinds, they 
can constitute numbers, abbreviations, mathematical formulae, etc. We 
will consider now only those words which are composed exclusively 
of letters (or have been substituted by such a word, e.g. the word 5 
in English can be substituted by five). Of course, the division of text 
into words is of little interest for a linguist for two reasons. First, spell- 
ing rules are often rather loose, therefore the same text can be seg- 
mented in different ways, and "a word can have different spellings" 
(I put this in quotation marks because " word " obviously has a diffe- 
rent meaning in this context). Secondly, a word - again because of 
spelling rules - is sometimes too long for our purposes. I refer to cases 
when a word is obtained by concatenation of two or more different 
words (which happens in Polish and is very frequent in German for 
example). The second difficulty is more important and we solve it 
144 JANUSZ STANISLAW BIEfi" 
first by introducing the notion of a "lex ", which is a word or a sub- 
string of a word. Let us distinguish now word-types, word-tokens, 
lex-types and lex-tokens. The lexes are defined mainly by enumerating 
their lex-types. For practical purposes this is quite enough, and from 
the theoretical point of view the finite list of lex-types can be supple- 
mented by a device for generating potential lexes from a finite diction- 
ary of morphemes. I would like to stress our point that all lexes of 
practical significance come from a finite (although large) dictionary. 
It is also important that as a rule lexes are quite different from mor- 
phemes; they can be described loosely as "words, which because of 
spelling rules can be sometimes written together" 
2.2. Hierarchy of linguistic units. 
In this paragraph I present the hierarchy of linguistic units as im- 
plemented in the MAaYSIA system, i.e. as it was designed in 1970-1971. 
It has in general stood the test of time and the only changes it is subject 
to are of an aesthetic kind. The terminology I use here is consistent 
with English summaries of my papers (J. ST. BI~, 1971; 1972 a; 1972 b). 
Lexes exhibit different features, which are not equally relevant to 
us. It is natural then to consider them as variants of higher-level units. 
Therefore we introduce the notion of a "lexeme "; a lexeme-type 
consists of an ordered set of its allolexes and a choice function de- 
scribing what allolex is to be used in a given context. All allolexes of 
a lexeme should be fully equivalent from the linguistic point of view 
although they may have different spelling or pronunciation. Examples 
of allolexes in Polish are niego, jego, go (all mean "him ", the first is 
used after a preposition, the second at the beginning of an utterance, 
and the third in all other contexts), in German neue and neuer (strong 
and weak declensions of an adjective), in English a and an, etc. The 
choice functions are implemented as a special kind of finite automaton, 
which has as an input the lex-tokens which are in the neighbourhood 
of the lexeme-token under consideration. It follows from this that 
aUolexes are never stylistic variants. We think that distinguishing sty- 
listic differences in texts can be very useful in some applications, e.g. 
computer aided language teaching. Therefore every lexeme, apart from 
its strictly grammatical properties, has its stylistic features which I 
call frequency evocation and quality evocation. At present frequency 
evocation can have as a value one of five grades: proper, acceptable, 
TOWARDS COMPUTER SYSTEMS FOR CONVERSING IN POLISH 145 
rare, wrong, non-existent; the quality evocation is described by means 
of qualifying labels. All lexemes with the same grammatical properties 
fall into one "form ". We insist that for every form it is one lexeme 
which is the " best " one, i.e. it has the highest frequency evocation. 
This is a way to obtain a normative dictionary together with an ade- 
quate enough description of the real vocabulary. 
Until now we have introduced four linguistic units (word, lex, 
lexeme, form), but none of them is equivalent or even similar to the 
most popular meaning of word (in Polish slowo, wTraz), i.e. word in 
the sense of e.g.A. PENTILLX (1972). We call such a unit a "formeme " ; 
it is an ordered set of forms. The ordering is necessary, because with 
every position in the set some syntactic features are connected. Now 
the problem is: what syntactic features can be put together into one 
formeme, in other words, how to establish borders between formemes. 
Our answer is that it can be done only arbitrarily by trading off the 
complexity of dictionary entries and the grammar which uses them. 
In the MARYSIA system forms of a formeme can exhibit only features 
of number, case, gender and person; all other features are assigned to 
a formeme as a whole. 
There is also one more notion, it is the "group ". The group was 
designed for strictly technical purposes, i.e. for making dictionaries 
more compact by collapsing the descriptions of similar formemes into 
single entries. On account of the lack of a semantic component in the 
MARYSIA system it is used now in a different way. We put some for- 
memes into one group if and only if there are enough regular diffe- 
rences between them from the semantic point of view. For example, 
an adjectival class of groups contains in every entry the positive and 
comparative degrees of the adjective, its adjectival adverb and the 
adjectival noun; the verbal class of groups contains the Present Tense 
and simple forms of the Imperative Mood, the Past Participle, the Pas- 
sive and the adjectival Simultaneous Participle, etc. 
As far as I know, the notions introduced for the MARYSIA system 
have no counterparts in linguistic theories, mainly because MARYSrA 
notions account for stylistic variations. Another important difference 
is that they are based on text words and thercfore they do not describe 
phrases (e.g. verb forms which are spelled separately). It may seem 
strange that so many notions have to be introduced to clarify the no- 
tion of word (at least for Polish), but it seems to me that a convenient 
and elegant description of a vocabulary still requires some additional 
notions. 
10 
146 jANUSZ STANIS~AW m~ 
2.3. Morphological coordinates. 
It should be noted now that of the five notions introduced in the 
preceding paragraph, only one of them refers to an observable and print- 
able object, i.e. the lex (strictly, the lex-token). The problem is then 
how to refer to any concrete object of another type, e.g. a lexeme, a 
group, etc. The solution we use (J. ST. BI~, 1970; 1972 a) is the follow- 
ing. Every item of a vocabulary possesses its paradigm, i.e. the set of 
all lexes which are included (directly or not) in the item. When the 
full paradigm designates the object we can refer to it by enumerating 
lexes of the paradigm; if this is not the case, we have to mark the level 
of the item (e.g. the word pod, meaning " under ", can label the pre- 
positional formeme, the only inflexional form of the formeme, or the 
only lexeme of the form, etc.). This method is safe, but rather incon- 
venient when the paradigm of an item is numerous. In this case we 
can use an abbreviated method of reference, i.e. we may describe the 
paradigm instead of enumerating it. In most cases it is enough to give 
only one lex of the paradigm to describe it exactly, but in some situa- 
tions it may be necessary to give two or more of them. It is worth 
noting that any lex (or set of lexes) can be used to label a paradigm, 
although we may prefer the traditional convention of using the Nomi- 
native Singular for nouns, the Infinitive for verbs, etc. For distinguish- 
ing different levels of vocabulary it was suggested in J. ST. BIv~ (1972) 
that we use different type fonts (ot underlining and quotation marks 
in manuscripts), but the more traditional "labelled bracketing ", 
e.g. \[dom\] TM for the given formeme, can also serve for this purpose 
very well. 
The above mentioned method is very good for a human, but it 
is inconvenient for internal representation of vocabulary items in com- 
puter programs. Especially for this purpose we have designed "mor- 
phological coordinates ". We have noticed that every item can be 
referenced by means of giving the address of the biggest vocabulary 
item it is included in, and specifying some of its particular features. 
The features which are used for the purpose in the MAR'ZSlA system 
were also influenced by technical considerations. As the result of the 
trade-off the MARYSlA'S morphological coordinates are the following: 
1. The dictionary item address. The item may represent a group 
or a formeme. 
2. The morphological type of the formeme under consideration. 
TOWARDS COMPUTER SYSTEMS FOR CONVERSING IN POLISH 147 
For formeme items it serves mainly for checking purposes, but for 
groups it describes a subset of all formemes belonging to the given 
group. 
3. Serial number of the formeme in the formeme subset of the 
given morphological type. For formeme items it is equal to zero and 
serves only for checking. For group items it describes together with 
the second coordinate exactly one formeme of the given group. 
4,5,6,7. The values of, respectively, number, case, gender and 
person categories. A value equals zero if a category does not concern 
the formeme. 
8. Serial number of the allolex in the given lexeme. 
If all eight coordinates are specified, then as a rule we refer to exactly 
one lex. In some cases there are some stylistic variants of the given 
lex; they have the same morphological coordinates. Then we can 
specify qualifying labels for evocations which are of interest to us; 
if we do not do this, it is assumed we refer to the "best" lex of the 
given form. 
The most important property of the morphological coordinates is 
that they can serve as a convenient tool for handling useful sets of lexes. 
We obtain the result by leaving some coordinates unassigned. In this 
way we can reference e.g. the whole paradigm of a traditional verb by 
specifying only its dictionary address. We can refer to any form of the 
Present Tense of the given verb by specifying the first three coordinates. 
If we want to refer to any form of the Present Tense of any verb, we 
just have to leave the address coordinate unassigned. For checking 
agreement in an utterance we are interested in an object such as any 
noun (no matter whether a "normal " noun or the Gerund etc.), we 
can specify it by assigning the respective value to the second coordi- 
nate. There are many other possibilities, but the examples given above 
should be sufficient to prove that the morphological coordinates are a 
convenient means for handling different vocabulary items. 
3. MARYSIA SYSTEM FROM THE USER'S POINT OF VIEW 
3.1. Script. 
For every application of the MARYSIA system at least one script 
should be prepared. The primary purpose of a script is to establish a 
148 JANUSZ STANISLAW BIEI~ 
way of classifying all possible utterances into some kinds of required 
reaction types; this is obtained by listing "decomposition rules " which 
should be applied to an utterance for every phase of the man-machine 
dialog. The secondary purpose of the script is to allow generation of 
a computer response by means of "composition rules ". At the mo- 
ment scripts are coded in a formalism oriented towards its internal 
representation in the computer, because a planned preprocessor has 
not yet been implemented. Therefore I will not give any concrete 
example of a script, but \[ will describe it verbally. 
A decomposition rule is a basic item of a script, it describes a class 
of utterances, which are formally similar. It is composed of three parts: 
a list of lex schemata, a list of allowed permutations and the list of re- 
quired relations between lexes. Alex schema consists of eight slots 
for morphological coordinates, the slots can be filled by coordinate 
values or left unassigned. Ill this way a schema designates some sets 
of lexes, which can range from exactly one lex to the set of all lexes 
described in the system dictionary. There are also some special sche- 
mata, e.g. "short general schema" means any lex from the dictionary 
or an empty lex (i.e. no lex at all), " long general schema" means a 
a string, possibly empty, of lexes from the dictionary, separated by 
non-final punctuation marks (e.g. spaces, commas). There is also a 
very important schema called "word schema ", which matches every 
word not recognized by the morphological analysis of the system. 
Lists of permutations were introduced because of the fairly free word 
order in the Polish language. A permutation is a string of references to 
schemata, described in the first part of a rule, separated by descriptions 
of required punctuation marks (including an empty punctuation mark 
for lexes which are to compose words). For every permutation there is 
a "reaction ", which is not set by the system, but can be arbitrarily 
defined by a user (e.g. it can cause some computation or just point 
to a composition rule for preparing an answer). 
When a decomposition rule is applied to an utterance, the following 
actions are taken. First, the utterance is preprocessed to remove super- 
fluous spaces, change upper case letters to lower case equivalents, etc. 
Then the words are split into lexes when necessary, and lexes are clas- 
sified according to the schemata of the rule. Next, permutations are 
checked sequentially until one of them matches; now is the moment 
when the relational part of the rule becomes important. There are two 
types of relations. One of them is called "agreement " and really serves 
for checking agreement of given coordinates (usually number or Case) 
TOWARDS COMPUTER SYSTEMS FOR CONVERSING IN POLISH 149 
of instances of lexes described by specified schemata. The second one 
is called " government" and is used to compare a specified coordi- 
nate against a constant or against one of eight " phraseological num- 
bers " provided for every formeme by the dictionary. The phraseolo- 
gical numbers describe some syntactic features of a formeme, e.g. 
the rection of a verb, the gender of a noun, etc. If the specified relation 
holds, the match is successful and the reaction associated with the per- 
mutation is passed as the result; otherwise the next permutation or the 
next rule is applied. 
The structure of composition rules is very similar to decomposition 
rules. The main differences are that for obvious reasons there is only 
one permutation and that the permutation can refer not only to its 
own schemata, but also to instances of schemata of the most recently 
applied decomposition rule. The other difference is that the relations 
are not checked but realized, i.e. the value of a coordinate of one in- 
stance ofa lex is assigned to a specified coordinate slot of another schema 
(" agreement "), or the value of a constant or a phraseological number 
is passed to a specified slot (" government "). After this process all 
coordinate slots of all schemata should be filled, then the lexes speci- 
fied by coordinates are generated and printed as a computer utterance. 
Scripts contain all rules which are to be applied in a conversation. 
In different moments of a discourse it is necessary to use different sub- 
sets of decomposition rules or to apply a different order for matching 
them. For the purpose decomposition rules can be grouped into "ex- 
pectation sets ". This is not required for composition rules as they 
are pointed explicitly by reaction in the matched pernmtation or by 
the "brain " of a user's system. 
3.2. Dictionary. 
From the users' point of view the MARYSIA system should have only 
one dictionary; this is not the case at the moment because we have 
not yet implemented some necessary utility programs and a user who 
wants to update the MARYSIA's vocabulary is involved with three dic- 
tionaries. It is only a temporary situation and therefore I will describe 
now exclusively the main dictionary, which is to be the "only " one. 
The maha dictionary contains items, which are composed of three 
divisions: morphological, syntactic and pragmatic. The last one is not 
used in practice. The syntactic division is rather primitive, it consti- 
150 JANUSZ STANISLAW BIEI<I 
tutes just a set of eight phraseological numbers per formeme. The 
morphological division is of most interest to us and it is the most com- 
plicated. First, it is split into four parts, according to the four grades 
of frequency evocation. The reason for this is the following. In some 
applications it may be necessary to reduce the adequacy of the diction- 
ary because of constraints on dictionary size or because it just will not 
be needed; then we are able to remove easily the parts of items which 
are of no interest. Next, every part is a list of morphological segments 
(in most cases it contains only one segment). Every segment has its 
quality evocation, which is stored as a string constituting a qualifying 
label from W. DoRosz~wsKfs dictionary (1958), and is often empty. 
As in the case of frequency evocation, we can easily get rid of segments 
with no empty labels if we do not need them. Segments are of three 
types, which serve different purposes. The simplest one is a quotational 
segment, which contains a list with explicitly coded lexes, together 
with their morphological coordinates and their quality evocations. 
This segment is used separately only for uninflected items; it serves 
more often as a supplement to other types of segments and thus con- 
tains the "variant" or "exception " forms of a paradigm. The sec- 
ond type of segment is a generation segment, which is used to describe 
some irregular items by means of an algorithm for generating their 
lexes. The third and the most important one is a parametric type of 
segment. 
A parametric segment contains three ordered sets of parameters, 
which are called morphological evocations, morphological numbers 
and morphological bases. The latter two of them can be considered as 
a generalization of traditional concepts respectively of pattern of in- 
flexion and of a stem. The difference is that a base can be constituted by 
an arbitrarily defined string of letters, and the number of bases for 
describing the given type of item can also be arbitrarily defined. Simi- 
larly, inflexional patterns traditionally classify whole paradigms, but we 
can arbitrarily split the paradigm into some subparadigms (which may 
consist even of single forms) and then we can independently assign a 
description to every subparadigm. The morphological evocation has 
no counterpart: it decides whether a slot in a paradigm is filled by a 
given item or not. 
There are different levels of parametric segments. If a segment 
describes a formeme, then it belongs to the inflexional level. In the 
first stage of dictionary development all items can belong to this level, 
but if we want to make full use of script possibilities, we have to provide 
TOWARDS COMPUTER SYSTEMS FOR CONVERSING IN POLISH 151 
also the derivational level. The segments of this level describe groups, 
i.e. sets of formemes. This is done by simulating the inflexional seg- 
ments which had to be put in the dictionary if the derivational level 
was absent. We can introduce also a third level, the extractional one. 
An extractional segment provides parameters for a special type of 
group; by taking into account idiosyncrasies of a given type of group, 
the extraction segment can use less parameters (especially bases, which 
are very space consuming) than an equivalent derivational segment. 
The multilevel lex generation allows us to trade off between the size 
of a dictionary and the time of lex generation. Together with possi- 
bilities of other trade-offs, e.g. adequacy versus dictionary capacity, it 
should make the dictionary system easy to adapt to different applica- 
tions. 
3.3. System tables. 
Developing a good algorithm of inflexional analysis and synthesis 
is not an easy task. Instead of trying to obtain it in the first attempt, 
we decided to design our system as a set of table-driven programs. 
Therefore we may improve the system performance by exchanging 
step by step its tables; we can also easily change ottr previous decisions 
concerning, for example, borders of formemes, etc. It even seems pos- 
sible to change the MARYSlA system into another language version, 
the results of the ftrst attempt to do this (L. KWIECI~SKI, 1972) are en- 
couraging. Now we will review the system tables in the order of their 
application for system response. 
The input utterance is at first preprocessed and coded in special PF 
code; these are the only non-table-driven parts of the system. Then 
the words are divided into lexes. This is the task of two finite auto- 
anata (all system automata are, of course, driven by exchangeable tables), 
which scan a word in both directions and establish probable lex bor- 
ders. Now the lexes are to be transformed into keys for searching in 
a backing dictionary called the index. The transformation consists of 
cutting some letters from the ends of the lexes; the place for the cut 
is indicated by another set of automata. Every automaton of the set 
is working on the assumption that the lex belongs to a given formeme 
type, then the key (or keys) suggested by the automaton is searched 
(by means of hash coding) in the segment of the index which is devoted 
mainly to keys of the given formeme type. It has some advantages. 
152 JANUSZ STANISLAW BIE~ 
First, it is a way of solving some cases of homonymy, next, the automata 
are small and therefore easy to design and to debug. The keys are not 
matched exactly but owing to the PF code (J. ST. BI~, 1971) and spe- 
cial formats of the index entries, stem alternation is not taken into ac- 
count during the matching. It should be noted that at this moment 
some false hypotheses concerning lex borders are rejected because re- 
spective keys are not found in the index. The index contains pointers 
to the linkage dictionary, which was designed as a separate part because 
of storage constraints. The linkage dictionary yields for every lex its 
first three morphological coordinates, i.e. a formeme specification in- 
cluding formeme type. We have noted that the latter information 
together with the lex itself is usually enough to establish the rest of the 
morphological coordinates with high probability, therefore now the 
lex is inspected by one of the special automata, which outputs possible 
coordinates. If we do not require 100 percent probability that the coor- 
dinates are correct, we can stop the analysis at this moment; otherwise 
we reconstruct the lexes by synthesis and reject false hypotheses. 
The tables for the synthesis are more differentiated. First, there is 
a table of formatives, i.e. strings of letters used to compose lexes. Next, 
there is a table of choice ftmctions. Choice ftmctions are finite autonxata 
(they can also compose a choice function segment in a dictionary 
item). Then there is a large table called the inflexional partition. Besides 
some technical information it contains algorithms transforming para- 
meters of an inflexional segment into lexes; algorithms are expressed 
by means of extremely primitive " morphological description lan- 
guage" consisting of about ten instructions. The other two partitions 
are optional. They contain algorithms in the morphological descrip- 
tion language to transform parameters of one level into parameters 
of another level, i.e. extractional ones into derivational ones or deriva- 
tional parameters into inflexional ones. 
For all types of information needed by the MA~YSIA system there is 
a computer-independent (although at the moment rather awkward) 
external form. Its syntax is given in the ~Nr notation and its semantics 
is described in Polish in Bm~ et al. (1973). 
4. Present state of the project and the future development. 
Because of the delay in installing a new computer for Warsaw 
University, we have decided to implement the system in the first in- 
TOWARDS COMPUTER SYSTEMS FOR CONVERSING IN POLISH 153 
stance on the GIER computer, the only available one when the project 
was started. It was decided to write the programs in GIER ALGOl. 4 and 
to split the analysis and synthesis parts of the system into passes because 
of fast storage constraints. At the moment all parts of the system have 
been implemented, the tables of the system have been debugged and 
thoroughly tested; small dictionaries for testing purposes have been pre- 
pared. Still before us is checking the system as a whole, working ac- 
cording to some testing scripts. 
In the future we want to rewrite the MARYSIA system for a bigger 
and faster computer (it will probably be the IBM 360) and to develop 
some utility programs to facilitate loading the backing dictionaries and 
script writing. We will also check the generality of the system tables 
by preparing a German language version of the MARYSIA system. 
As far as the long-term plans are concerned, the following tasks 
are to be solved. First, it will be necessary to improve the adequacy 
of the MARYSlA morphological component by increasing the number 
of entries in the dictionaries. Secondly, it will be necessary to develop 
systems which will cover the higher levels of the language; because 
of our "bottom-up " approach to language description it will be the 
syntax that will be elaborated next. The third direction of the research 
can be called developing text-world interfaces; I mean by this accepting 
texts prepared for typesetting devices, optical character recognition, 
and voice input and output. For technical reasons, the ocR will probably 
be excluded; speech processing by computer is the interest of another 
group at Warsaw University and we hope to join together at a suit- 
able moment, which should not be before developing at least a good 
syntactic parser (following the recent ideas of e.g.D.R. HILL, 1972). 
Therefore in the near future we will be interested only in input of 
text coded on different kinds of media used in the printing industry. 
154 JANUSZ STANISLAW BIEI<I 
SAMPLI~ DICTIONARY BNTRIBS 
-S _LO_NCU- 
-S_LO_NCU- 
-S_LO_NCA- 
-S_LO_NC- 
0 \] 
0,0,0 
1 
\[I, 
4,1,0,5,1,1,0,0 \] 
0 > 
1, < 
1,2 
0 
1,3 \[ 
2,7,6 
2,2 
1,5,1,3,1,5,4 
-P_LUCO- 
-P_LUCA- 
-P_LUCU- 
-P_LUCU- 
-P_LUCA- 
-P_LUC- 
0 \] 
0,0,0. 
I 
\[I 
4,1,0,5,1,1,0,0 \] 
0 > 
1, < 
1,2 
0 
1,3 \[ 
2,7,6 
2,2 
1,6,1,3,1,5,4 
-_LYKO- 
TOWARDS COMPUTER SYSTEMS FOR CONVERSING IN POLISH 155 
-_LYKA- 
-_LYKU- 
-_LYKU- 
-_LYKA- 
-_LYK- 
0 \] 
0,0,0 
1 
\[1 
4,1,0,5,1,2,0,0 \] 
0 > 
1, < 
1,2 
0 
1,3 \[ 
2,7,6 
2,2 
1,5,1,3,1,5,4 
-D_LUTO- 
-D_LUTA- 
-D_LUTU- 
-D_LUCIE- 
-D_LUTA- 
-D_LUT- 
0 \] 
0,0,0 
1 
\[1 
4,1,0,5,1,2,0,0 \] 
0 > 
2, 
< 
1,2 
0 
1,3 
\[2,7,6 
2,2 
1,5,1,6,1,8,7 
-PISKI.,_E- 
156 JANUSZ STANISLAW BIEI~ 
SAMPLE PARADIGM LISTING, USED FOR CHECKING THE DICTIONARY 
\[BIEN\] CKL2 
\[BIEN\] CKL2 
ITEM 1596 
LEVEL 1 TYPE 6 
FORMEME TYPE 6 
FORM CATEGORIES 
PR.OPER. LEXEME 
LEX JA 
LABEL 
0 1 0 0 
FOR.M CATEGORIES 0 2 0 0 
PR.OPER LEXEME 
CHOICE FUNCTION ADDR.ESS 
ALLOLEX NUMBEP. 1 
LEX MNIE 
LABEL 
ALLOLEX NUMBER. 2 
LEX MNIE 
LABEL 
ALLOLEX NUMBER. 3 
LEX MI_E 
LABEL 
73 
FOR.M CATEGORIES 0 3 0 0 
PR.OPER LEXEME 
CHOICE FUNCTION ADDRESS 
ALLOLEX NUMBER. 1 
LEX MNIE 
LABEL 
ALLOLEX NUMBER 2 
IbEX MNIE 
LABEL 
ALLOLEX NUMBER. 3 
LEX MI 
LABEL 
73 
FOR.M CATEGORIES 0 4 0 0 
PROPEI~ LEXEME 
CHOICE FUNCTION ADDRESS 
ALLOLEX NUMBER. 1 
73 
TOWARDS COMPUTER SYSTEMS FOR CONVERSING IN POLISH 157 
LEX MNIE 
LABEL 
ALLOLEX NUMBER. 2 
LEX MNIE 
LABEL 
ALLOLEX NUMBER 3 
LEX MI_E 
LABEL 
FORM CATEGORIES 0 5 0 0 
PROPER. LEXEME 
LEX MN_A 
LABEL 
FORM CATEGORIES 0 6 0 0 
PR.OPER LEXEME 
LEX MNIE 
LABEL 
FORM CATEGORIES 0 7 0 0 
NON-EXISTENT 
ITEM 1615 
LEVEL 1 TYPE 6 
FORMEME TYPE 6 
FORM CATEGORIES 0 1 0 0 
PROPER. LEXEME 
LEX TY 
LABEL 
FORM CATEGORIES 0 2 0 0 
PROPER LEXEME 
CHOICE FUNCTION ADDRESS 73 
ALLOLEX NUMBER. 1 
LEX CIEBIE 
LABEL 
ALLOLEX NUMBER 2 
LEX CIEBIE 
LABEL 
ALLOLEX NUMBER 3 
LEX CI_E 
LABEL 
158 JANUSZ STAIqlSLAW BIEI~ 
FORM CATEGORIES 0 3 0 0 
PR.OPER. LEXEME 
CHOICE FUNCTION ADDRESS 
ALLOLEX NUMBER. 1 
LEX TOBIE 
LABEL 
ALLOLEX NUMBER. 2 
LEX TOBIE 
LABEL 
ALLOLEX NUMBER 3 
LEX CI 
LABEL 
73 
FORM CATEGORIES 0 4 0 0 
PROPER LEXEME 
CHOICE FUNCTION ADDRESS 
ALLOLEX NUMBER. 1 
LEX CIEBIE 
LABEL 
ALLOLEX NUMBER. 2 
LEX CIEBIE 
LABEL 
ALLOLEX NUMBER 3 
LEX CI_E 
LABEL 
73 

References

J. ST. Blzgl, Prowizoryczna terminologia 
czasownikowa (unpublished paper), 
1970. 

J. ST. BIEI(T, An Alphabetic Code for the 
lnflexional Analysis of Polish Texts, 
in <~ Algorytmy~>, VIII (1971), 14. 

J. ST. BI~, O pewnych problemach prze- 
twarzania jezykSw fleksyjnych na ma- 
szynach cyfrowych, in <~ Prace Filologi- 
czne ~>, XXIII (1972"). 

J. ST. BI~, O dw&h poj¢ciach po~.'yte- 
cznych przy automatycznym przetwa- 
rzaniu tekstdw, in ;~ volskich studidw 
slawistycznych, Seria 4, J~zykoznaw- 
stwo, Warszawa, 1972 b. 

J. ST. Bmgr, W. LUKASZ~WICZ, S. SZPA- 
KOWICZ, Opls systemu MARYSIA, in 
<~ Sprawozdania IMM i ZON UW,> (l~e- 
ports of the Warsaw University Com- 
putational Centre), n. 41, 42, 43 (1973). 

W. DoRoszEwsI¢I (ed.), Stownik jczyka 
polskiego, 11 voll.,Warszawa 1958-1969. 

I~. W. FLOYD, Towards Interactive Design 
of Correct Programs, iriP Congress 
1971, Invited Papers 1971. 

D. R. HirE, An Abbreviated Guide to 
Planning for Speech Interaction with 
Machines: the State of the Art, in ~ Inter- 
national Journal of Man-Machine Stu- 
dies ~>, IV (1972) 4. 

L. KWIECIglSKI, Die deutschsprachige Va- 
riante des Konversationssystems MARYSIA 
(M.A. thesis), Warsaw 1972. 

C. M~ADOW, Man-Machine Communica- 
tion, New York 1970. 

A. PENTILrT~, The Word, in ~ Linguistics >>, 
LXXXVIII (1972), pp. 32-37. 

J. WEIZENBAUM, ELIZA: a Computer 
Program for the Study of Natural Lan- 
guage Communication between Man and 
Machine, in <~ Communications of the 
ACMes, IX (1966) 1. 

T. WmOGRAD, Procedures as a Represen- 
tation for Data in a Computer Program 
for Understanding Natural Language, 
Cambridge (Mass.) 1971. 
