TOWARDS A NEW TYPE OF ~.~O~IC ANALY~I~ 
Eva Eoktov~ 
9. kv~tna 1576 
39001 T~bor, Czechoslovakia 
ABST~..ACT 
The present paper provides a report on 
2. new system of an automated morphemic 
analysis of technical texts in Czech as 
a highly inflectional language, which is 
being 2re~oared by the linguistic tes_m of 
the :~cult~ of ~,~athematics and ~hysics in 
Pracae , within the project of man-machine 
cozununication without a pre-arranged data 
base (TIBAQ). The kind of morphemic 
analysis z~resented here is based on 
a retrograde (right-to-left) analysis of 
words by means of morphemically unambi- 
~-uous or irresolvably ambiguous word-ends, 
which do not coincide with the etymologi- 
cal word-endinjs but correspond to the 
structure of the accidental cases of 
zorphemic ~.mbiguity in an inflectional 
language (word-endings being accountable 
for in a certain way by word-ends). The 
algorithm of analysis can thus dispense 
with any dictionary (of morphemic 
irrei-alarities and exceptions), economi- 
cally accounting especially for productive 
word-endings. The word-ends of the 
analysis are assigned several kinds of 
--.or~hemic. information, concerning 
morphemic categories and le~matization. 
The analysis is based on the absolute 
_~ .qL ~.ncy of word-ends in technical texts 
~nd ic able to interact with the semantic 
I. INTf.~CDUCT!0N 
The_ ,r-sent ~:ai:er ~rovides a re~ort on 
£ new sL'~tcm of an automated morphemic 
:tnalyui~ of tec~hnical texts in Czech, 
:Jhich i~ bein~ 9rei.~ared by the linguistic 
te~m of th~ ?~culty of :~thematics and 
-hy~ic~ in ?ra6ue. The mori;henic snalysis 
of Czech, which i~ a highly inflectional 
~.ns-L-~-, constitutes the starting Feint 
r..., _,~_ aa~j kind of uuto~autpd Froees~ing of 
lunLuug~, -~',zncins' fro::: automatic 
infor:::e.tion retrieval to natural len~aage 
~.c~d e ~-s rand ino. 
There is a ~revious project of mor?he- 
::-,ic ~nalb'sis of Czech described in 
(";eisheitelov~, Xr~Ifkov~ and 3gall, 
I'j829, which is based on an a~n~iLsis of 
ety~nological word-stems and word-endings 
(suffixes). The present system, on ~he 
other hand, i3 based on a retrograde 
(right-to-left) analysis of words, which 
makes it possible to disDense bo~h with 
the dictionary of stems and the dictiona- 
ry of endings; it was partly inspired hy 
the system ~CSAIC (Eirschner, 1982) 
(intended first of all for automatic 
indexing of technical texts), which is 
also based on a kind of retrograde 
analysis: namely, on singlingcut the 
four rightmost s~umbols of the word-forzs 
of autosemantic words, which are then 
matched against a list of word-endings. 
This kind of analysis, however, c~n.not 
avoid the danger of ambiguity, which is 
prevented by a n~mber of ad-hcc 
restrictions, for example reducing the 
universe of discourse. 
The present system of morDnemxc 
analysis differs from the ~revious ene~ 
in several essential respects: 
(i) The algorithm of the ~resent type 
of morphemic analysis can be viewed as 
a structured list of morp:hemically un- 
ambiguous or irresolvably ~nbiguous 
word-ends of Czech words (which may be 
accidentally identical with full word- 
forms) including information concerning 
their morphemic categories and leL~uati- 
zation. We believe that this ;rinciyle 
can be considered as adequate for the 
morphemic analysis of any inflec~iona! 
language. 
(ii) In the present system, it is also 
easier to carry out lemmatization: there 
are only several tens of sim~le 8nd 
highly general le."tmatization rules 
appended to the morphemic information 
accompanying every word-end in the 
algorithm. 
(iii) In the present system, the burden 
of the analysis lies entirely on the 
algoritkm. There is no need of any 
dictionary in w.hich etymological irre~u- 
larities would be listed. 
(iv) The algorithm is based on the 
absolute frequency of word-ends in 
tec.hnical texts. It consists of two 
parts; the first of them involves about 
two hundred word-ends by means of which 
it is ~ossible to resolve about fifty 
percent of a technical text. 
(v) ~y means of the algorithm it is 
possible to analyze an unlimited number 
179 
.of new (newly coined) words with product- 
ive et~ological word-endings. Thus, both 
the user and the linguist are relieved of 
the work which must be usually done when 
a new lexical item is being incorporated 
into a system of morphemic analysis of an 
inflectional language. 
(vi) The algorithm is going to be 
implemented in PL/1 within a system of 
natural language understanding, namely 
the project of man-machine communication 
called TIBAQ (Text-and-Inference Based 
Answering of Questions, cf. (Haji~ov@ 
and Sgall, 1981)) with no pre-arranged 
data base and with the capacity of self- 
-enriching by information drawn from the 
text; the project is based on the 
lin~uistic theory of the Functional 
Generative Description. 
(vii) Underlying the algorithm is 
large ~aount of empirical work; it 
~n~lyzes several tens of thousands of 
(autosemantic and synsemantic) words 
(dra~ from a retrograde dictionary of 
Czech, cf. (Slavf~kov~, 1975)), including 
the word-foEas of inflected words. The 
choice of the autosemantic lexical units 
to be analyzed was carried out with 
respect to technical texts concerning 
microelectronics. 
2. ~ PHILCSOPHYOF THE STST~ 
The major novelty of the present 
approach consists in the conception of 
(morphemically unambiguous or irresol- 
vably ~nbiguous) word-ends, which do not 
correspond to the (etymological) word- 
-inflection and word-formation endings 
but to the cases of accidental morphemic 
~nbiguity in an inflectional language, 
every word-ending being accountable for 
by at least one word-end (piece of output 
information). On the other hand, every 
word-end corresponds to (stands for) at 
least one lexical word, and due to the 
cases of morphemic ~mbi~uity, it repre- 
sents ~t least one word-form. A word-end 
i~ usually equivalent to a part of a 
word-form, "out accidentally it may be 
equivalent to a full word-form. 
The algorit~, of analysis, embodying 
conception of procedural morphemics, 
can be viewed as a structured list of 
word-ends arranged in a branching struct- 
ure consisting of ~es-no answers to 
queries, with correspon-~ing sequences 
(strings) of symbols of increasing 
length, which is dub to the retrograde 
adding of symbols (we use 40 letters 
of the Czech alphabet, including the 
ones with diacritics), until morphemi- 
cally unambiguous or irresolvably 
~nbiguous word-ends are found (morphemic 
ambiguity counting as a valid result of 
the analysis, since it can be resolved, 
in most cases, by means of the syntactic 
analysis). The word-ends are assigned 
the kinds of information as described in 
section 3. 
In the present system of morphemic ana- 
lysis, there is no place for the notion of 
(etymological) irregularity, all word-ends 
being equally "regular"; the differences 
between them can be accounted for e.g. in 
terms of their length or of their positi- 
ons on the scale of absolute frequency 
(cf. section 5). It may even be the case 
that an etymologically highly irregular 
word-form can be analyzed by a relatively 
small number of symbols (of its word-end), 
and the other way round. 
In the horizontal progress of the algo- 
rithm (which corresponds to the answer 
l~nes - a new symbol is added) the output 
ormation concerns a single word-end, 
while in the vertical progress (corres- 
ponding to the answer n oo- different sym- 
bols than the one(s) in question are 
added) it usually concerns more than one 
word-end. These word-ends can be labelled 
as complementary word-ends with respect 
to the horizontal word-end(s) in question; 
they consist of the same sequence of 
symbols as the correlated horizontal word- 
-ends with the exception of their respect- 
ive leftmost symbols, which belong to the 
complementary set of symbols of the alpha- 
bet with respect to the leftmost symbol(s) 
of the horizontal word-end(s), according 
to the combinatorics of letters in exist- 
ing Czech words (for example, the comple- 
mentary word-ends to the horizontal word- 
-ends /m~r, dm~r, #m~r are only four: 
~m~r, ~__~__j.r, omer, ~ (the symbo_ / 
stands for the end of the word, i.e. indi- 
cates a word-end in the form of a full 
word-form)). Throughout the algorithm, 
the notation concerning the complementary 
word-ends is abbreviated in that in their 
place only their common output informat- 
ion is written (cf. the three occurrences 
of A in Pigure 1 below). 
The conception just discussed can be 
illustrated by a chunk of the algorithm 
accounting for the frequent word- 
-inflection ending ~ (which is an adje- 
ctival word-ending, ambiguous among nomi- 
native and accusative singular masculine- 
-inanimate, and nominative singular 
masculine-animate, thus representing the 
adjectival "normal form,'), which clashes 
only with /pr# (adverb), being accounted 
for by the three occurrences of the out- 
put information A (standing for the mor- 
phemic information in question) in Y~urel. 
Figure 1. A chunk of the algorithm. 
-- r~ -- pr~ -- /pr# -- B 
I A A A 
The three occurrences of A in Figure I 
can be indicated, for the sake of clarity, 
as AI, A 2 and A3: A I (corresponding to the 
180 
horizontal string r~) accounting for those 
Czech adjectives (In the given foI~n) ~vhose 
penultimate symbol is different from r 
(such as velk# (big)), A 2 (correspondTng 
to the horizontal string pr#) accountiru~ 
for those Czech adjectives---\[in the given 
form) whose second symbol from the right 
is r and whose third symbol from the right 
is ~ifferent from ~ (such as dobr@ 
(good)), and A 3 (c~rresponding'--~the 
horizontal word-end /~org) accounting for 
those Czech adjectives (in the given form) 
whose third and second symbols from the 
right are ~r, respectively, and whose 
fourth symbol from the right is different 
from /, i.e. which are longer than three 
s~nbols (in Czech, there is only one such 
~djective, namely k_~ (loose, plump)). 
Gn the whole, A1, A 2 and A 3 account for 
all Czech adjectives (in the given form). 
3. KINDS OF INFC~ATION 
The word-ends (i.e. the horizontal 
word-ends and the complementary word-ends 
with respect to the given horizontal 
word-ends) are assigned the following 
kinds of information. 
A. r~orphemic information. 
(i) The information concerning part-of- 
-speech categories includes the distinct- 
ion between Nouns, Verbs (these kinds of 
information are further subcategorized), 
Adjectives (A), Adverbs (B), Prepositions 
(C), Conjunctiuns (D) and Pronouns (Zj) 
(there are distinguished three kinds of 
pronouns, namely those which function as 
nouns, those which functiomae adjectives, 
and those which function both ways). 
(ii) The information concerning gram- 
matical categories includes the following 
distinctions (with respect to the part- 
-of-speech categories). 
(a) Declension. 
(aa) Case (six cases, indicated as l, 
2, 3, 4, 6 and 7) is distinguished not 
only with nouns, but due to grammatical 
agreement, also with adjectives and pro- 
no Ltns. 
(bb) Number (singular and plural, indi- 
cated as sg and pl, respectively) is 
distinguished with nouns, and due to 
grammatical agreement, also with adjecti- 
ves, pronouns and verbs. 
(cc) Gender (combined with animateness) 
is distinguished with nouns, and due to 
grammatical agreement, partly also with 
adjectives, pronouns and verbs (with 
verbs, for example, in the past and pas- 
sive participles plural). ~ith nouns, 
four genders are distinguished: masculine- 
-inanimate (N), masculine-animate (~), 
feminine (F), and neuter (S). The care T gory of animateness is involved rather 
with masculine then with feminine and 
neuter nouns because with plural masculi- 
ne nouns the difference in animateness is 
present, due to grammatical agreement, 
also with verbs and adjectives in the 
above mentioned way, and because in tech- 
nical texts substantially more masculine- 
-animate than feminine-animate nouns are found. 
(b) Conjugation. 
With verbs, there is distingtuished 
person (three persons, with the exception 
stated in section 4), number (cf. (bb) 
above), tense (present, past and future), 
mood (indicative and imperative), and 
voice (active and passive). As concerns 
notation, usually several kinds of infor- 
mation are collapsed in a single abbrevi- 
ation, cf. K standing for the third per- 
son singular active indicative present. 
There is no need of information 
concerning the in/lectional types of 
nouns, adjectives and verbs; for example 
the word-ends corresponding to the class 
of nouns represented by the word-forms 
katodami (by cathodes) and vlastnostmi 
(by properties) (both 7 pl)are assigned 
the same morphemic information, though 
the word-forms in question belong to 
etymologically quite different types of 
inflection of (feminine) nouns (of. the 
difference between the word-inflection 
endings, ami and m i, respectively). 
B. Lemm~tization information. 
Lemmatizatimn, i.e. convering an in- 
flected word-form into the normal form 
(i.e. 1 sg with nouns, 1 sg masculine 
with adjectives and pronouns, and the 
infinitive form with verbs) has a speci- 
fic purpose, being connected with those 
applications of morphemic analysis which 
concern the terminological elements of 
technical texts (such as automatic inde- 
xing). 
In the present system, lemmatization 
is carried out by a retrograde erasing of 
a certain number of symbols (possibly 
zero) and by adding a number of specific 
symbols (possibly zero) to what has been 
left after the erasing; in lemmatization 
(unlike in the rest of the algorithm) we 
work with diacritic marks as specific 
symbols. In this way, lemmatization can 
be accounted for by means of several 
tens of simple and highly general rules, 
cutting across the inflectional endings 
and also across the inflectional types 
of different part-of-speech categories. 
It should be pointed out that lemmatizat- 
ion concerns rather the concrete words 
(word-forms) found in a text than the 
word-ends themselves: though the majority 
of the lemmatization rules operate on 
word-ends (concerning usually only a part 
of a word-end, which is close to a word- 
181 
-ending, cf. the s~mbol y in the word-end 
to_/~, corres~ondi~g to the word-form 
.catod~;), in exceFtional cases, ~or example 
where the stem of a word is affected by an 
alternation, the erasing may reach to the 
left of the concrete word, i.e. behind the 
word-end; cf. the word-end s.te (consisting 
of three symbols), which, with some 
simplifications, unambi£uously indicates 
a verb (K), but which is not sufficient 
for the lem~matization of such verb-forms 
as roste (grows) to their infinitives 
~--'-~o ~rcw)), where four rightmest 
s~ls-~-~'~-2~of-the concrete word should be 
considere~. 
The rules of le~matization have general- 
ly the form \[X; abc...\], where X stands 
for the number of the symbols to be 
erased, and abe..., for the specific 
symbols tc be added. In the algerithn, the 
rules are usually referred to by numbers, 
~nd listed in an acoendix. Thus, for 
ex~nple, ~.~ule 2 (\[1, a\]) converts 
(cathodes; ~. 2 sg 4 1 ~ 4 pl) into-~a 
(oathods; F 1 sg) by erasing one sym-~ 
(mmzely Z) and by adding one symbol 
(namely a). (<. stands for the relation of 
~:bigui t~). 
Every !e~±matization rule has at least 
one agplication to various t3~es of 
r or--hemic categories concerning not only 
different distinctions within a single 
~art-of-speech category (typically, 
different genders with nouns) but also 
different ~art-of-speech categories 
(for e~x2-z~le, a single lemmatization ztule 
cc_u h.z a~lied to nouns, adjectives, a.ud 
v~rLs): this met.us that a lem--~tization 
rul~J _,ay cc;~cern, in any of the part-of- 
s~e=.ch categories i~ question, more than 
o~. :,o2d-eadi~g (~.~. of different gender), 
~ th~e word-endings may be ia turin 
_zbi~uou- %etw~.en various case-and-ntun%er 
ilia c~l hJ ill~strated %y \[ule 6 a~qd 
.~u.~e o. _.u~e 6 (\[1; ~ \]- erase one 
~uhol, &&d nothing) cuts acrous nouns, 
uu~C..V=-, uric ~e_,.~, conY_. ~!n~ --o-. 
c.o.i~ ( co:~i'mlicat ice'=-) to S~O.\] (CC~/tUql- 
~'" ~ d"~ ('zv -3ur~ , tc jou.ug ...) 
to ~you_n_g), ~ud ~ (suc,~ec. 
• ~.- ~ ~l' ~ ~ • I ~ -- "~ , 
ir ~,. ~I F1 ~a~ two ~.,mho!s. add nothing) 
~ :: ~p~lic tion~ (to ~ii genders of 
notu~s znd to ~j~ctivcs) and corre~.onds, 
on the whole, to 16 word-endings, out of 
which two zre two-ways ~abiguous as 
cone~r~.ls caue ~-~.ad nu~nber. The 16 word- 
-e~di}~u~s are illustrated b~ the word- 
-fol'~L~ in ?i~ure 2 (where obvod = cir- 
cuit, odborn/k = expert, ka--~ = 
cathode, vlastnost = ~rovertv, relace= 
relation, staveni = building, ~ = 
yc~a%C, ~nd pGvod.nf = original). 
Pi~-ure 2. Lemm~atization. 
N: obvod~ (6 si); obvodem (7 sg); 
(2 pl) 
~; odbornlkem (7 sg); odborni!cA (2 ~l) 
F: katod~n, vlastnostem (3~ rl); 
katod~mi, vLstng~tmi, relscemi 
(7 pl) 
3: stavenfch (6 pl); stavenfmi (7 91) 
A: mlad~ch, nqvodnfch (2 ~ 6 pl); 
mlad~i, ~f~vodnimi (7 ~l) 
In the above survey, the words which 
are assigned co~mon info~ation (e.g. 
katodami, vlastnostmi , relacezi) bel©ng 
to etymolegically different types of in- 
flection, which, however, need net be 
distinguished here: though the ler-matizn- 
tion rules can be arranged in a scale 
according to their complexity or range of 
application, the present method of 
lemmatization covers both sim~le (recular) 
and complicated (irregular) ty?es of 
word-inflection and word-formation in 
an equally economic manner. 
C. Semantic information. 
1~ne semantic analysis by me~ns of the 
retrograde morphemic analysis is s yet 
unfinished, but presumably smoothly 
feasible task, which will be based on the 
account of productive word-endings by 
means of word-ends. 
The considerations concerning the 
semantic analysis should start from 
establishing a set of semantic categories 
(classes) of nouns and 9ossibly also 
adjectives which are considered tc be 
relevant for the analysis of tec~nicel 
texts. In addition to the considcr?tion 
of ~roductive word-endings, there can be 
also introduced into the algorit}uu ~uch 
word-ends which account for semanticzlly 
relevant but only restrictedl~- productive 
word-for~ation endins~ (such ~s netr 
(meter)), if such word-ends have been 
"hidden" in the complementary word-ends 
of the algorit~hm (for ex2~mple, it may 
happen that a productive word-endinj 
coinciding with a single word-end (such 
as tko, cf. below) is "hidden" in this 
way~'~. 
In establishing the set of semenqtic 
categories t we c~n draw from (~ur~ov&, 
1980) and \[Kirsc½%er, 1983), vrogesing 
that there should be introduced for 
ex~zple the category of Inst~Ament (Tool) 
(as expressed by the productive word- 
-endings dle, tko, aS, i~, ~ka, 4r, n~ 
and by the restr!cte--~ly proauct~ve 
word-endincs mctr. ~, f~n, ~nd skoo), 
eni, ~nl I A~ and z~, ~ro~erty (cst, ita 
~-g ~h-~%', ,-Ttc. -- .... 
The information concerning semantic 
182 
analysis can be rendered by indicating 
certain pieces of output information as 
semantically relevant (with respect to the 
classification of semantic categories), 
but prssumably it v,:\[ll be oven possible to 
state this kind of information essentially 
only in an appendix to the algorithm. Such 
"-:_u appendix should consist of the specifi- 
cation that every word-end (this concerns 
also complementary word-ends) whose right- 
most symbols coincide with the word-ending 
in question (because a word-end is usually 
longer than, or identical to, the word- 
~nding which is accounted for by it) s~d 
which is assigned certain morphemic infor- 
mation (concerning usually gender) 
corresfonds to the semantic category in 
question; of. all word-ends whose three 
rightmost sy~bols are acl and which are 
assigned the output in o~mation F 7 sg 
2 pl (such as lacf, which is "hidden" in 
the cm.~plementary word-ends) correspond 
to the semantic category of nouns of 
action (in this case, acf is correlated 
to the normal form with ace, which is the 
Czech equivalent of the E-~lish ation). 
_oss~ble exceptzons to the semantic znfor- 
~ation concerning the word-ends which 
acc~r~at for the word-endings in question 
;~kculd be indicated directly in the algo- 
riti~ (e.g. by superscripts in the output 
infer:nation); for example, the above- 
-':entioned nominal word-ending acf (which 
slstamatically clashes with the a~ectival 
word-endind acf N ~ F ~ S l, 4 sg ~ Z 
1 sg ~ ~. 2, 3, 6, 7 sg ~ N ~ ~ ~ F ~ S 
l, 4 pl, and thus is accounted for by 
s bout 3C pieces of output information) 
has :&;out five semantic exceptions to it 
(such as nadacf (nadace = grant, support 
- n~ither ac~lon nor result of action)), 
for which there should be established 
• .< ~cial word-ends in the algorit~hm, with 
the indication, in the output information, 
~:f their ~em:ntic exceptionality (with 
r,;uy:-:ct to the other word-ends whose 
ri~:~t.;~ost ~y;~bols are -cf and ~hich cre 
.~igned the output inhumation in 
%uestion), i.e. of their non-membership 
in the class of nouns of action. 
4. ~IGUI~f 
This section brings information 
conce~in b (i) c~ses of morphemic dist- 
inctions not included in the algoritk~; 
(ii) genuine irresolvable cases, and 
(iii) co sos of mor\[:hemically irresolvsble 
mubigmity. 
(i) Cases of morphemic distinctions not 
included in the algorithm... We prefer not 
to include in the algorithm of analysis 
(with yossible exceptions) morphemic 
distinctions concerning these word- 
-inflection endinLs which occur in tech- 
nical texts only rarelj or not at all, 
i art~c~r~y the following distinctions: 
Ca) Verbs: 1 sg indicative present 
(such as ~ed~oklAd&m (I suppose)); 2 sg 
indicative present (such as p~edroklAdA~ 
(you suppose)); 2 sg imperative (such as 
(choose)); transgressive forms 
(such as p~edpokl~da~e, ~ed~okl~dajlc, 
p~edpoklAdajice (supposing)), and 1 and 2 
pl imperative are assigned only the morph- 
emic but not the lemmatization information 
because these forms are supposed not to 
be semantically relevant. 
(b) Nouns: 5 sg and pl (such as odbor- 
nlku! (expert!)). 
(c) Adjectives: masculine-animate pl 
(such as vzsocl (tall)). 
(ii) Genuine irresolvable cases. By the 
present kind of analysis, there fracti- 
cally cannot be resolved, in spite of 
their regular inflection, geographical 
and personal proper names, their multi- 
tude preventin~ the linguist from 
empirically establishing their (unambi- 
guous or ~mbiguous) word-ends. This can 
be partly overcome by introducing into 
the analysis the recognition of capital 
letters and/or by establishing a "right 
set" of proper n~mes to be analyzed 
(which seems to be an easier task with 
geograohical names, of. Evrooa (Zuro~e), 
rraha ~Prague), etc.). On thl~ solution, 
oT'o'r"~xample, the accusative form of ~raha 
(F), namely Prahu, would yield a case of 
morphemically irresolvable ambiguity with 
the locative form of or~h (N; t.hreshold), 
namely prahu. Also cer~zn ~requent 
personal names can be treated in this way 
(cf. Schottk~,ho dioda (the diode of 
Schottky)). 
(iii) Cases of morphemically irresol- 
vable mmbiguity. The cases of this kind 
of am.big~ity concern all of the morphemic 
categories as well as lemmatization, 
occurring singly or as combined in vario~s 
ways. In what follows, the relevcnt cacos 
of ~J~biguity arc indicated hj ~, 3ud the 
other cases of ambiguity are inducated 
by coz~ms or semicolons. 
(a) ~mbiguity concerning only Dart-of- 
-speech category; cf. the ~mbiguity of 
the word-ends corresponding to non- 
-inflected words, such as the ambiguity 
of the word-end t~ between adverb ~nd 
~reposition (E ~-'G), t~ standing for 
several words including e.g. ve~rnit~ 
(inside) or zevnit~ (from inside). 
(b) ~tr, biEaity concernin~ \[srt-of-si:eech 
category in combination with ~ther kinds 
of ~mbiguity; cf. the ~nbiguity of the 
word-ends corresponding to inflected 
.,erda, such a~ ~n~ ~,,b~a~ ~, o: ~,.~ ..~o~ d- 
end ~ octw~n no~u and verb (~ l, 4 
sg ~ Infinitive: growth ~ to ~zrow), or 
the ~mbij~it I ~f the word-end ,/rs,rn& 
between adjective and verb (A ~; 
U l, 4 pl ~ E: direct ~ straightens). 
183 
(c) .~mbiguity concerning only gender, 
cf. the ambiguity in gender concerning 
word-inflection endings with adjectives, 
such as the ambiguity of the word-ends 
(coinciding, with one exception, with 
worduinflection endings) ~ch (2, 6 pl) and 
\[7 pl), which are amblguous amon all w g genders (N ~ ~ % • % S). 
(d) ~abiguity concerning gender in 
combination with other kinds of ambiguity: 
(aa) .~nbiguity concerning gender in 
combination with case and number, cf. the 
word-end /set, which is ambiguous between 
masculine,inauimate and neuter noun (N l, 
4 sg % S 2 pl: set ~ of hundreds). 
(bb) Surface-syntax ambiguity concern- 
ing gender in combination with underlying 
~mbiguity concerning case and number, cf. 
the word-end /9~dky (lines), which is 
a;~biguous between masculine-inanimate and 
feminine noun (N l, 4, 7 sg ~ F 2 sg; l, 
4 pl). This ambiguity in gender, however, 
is not present on the underlying level 
of Czech, where only a single lexical 
item (masculine-inanimate noun) is hypo- 
thesized to occur, as corresponding to 
the two surface normal forms (i.e. 
masculine-inanimate and feminine), the 
two surface genders accidentally yielding 
ambiguity in the word-end (word-form) 
/~dk~. 
(cc) Ambiguity concerning gender in 
combination with animateness (and case), 
cf. the word-end /~len (member), which is 
ambiguous between masculine-inanimate and 
masculine-animate noun (N l, 4 sg § 
1 sg). (In the majority of the other 
cases of the inflection of masculine 
nouns, the ambiguity in animateness is 
not accompanied by the case ambiguity.) 
(e) Ambiguity concerning only case (and 
ntunber), not accompanied by any other 
kinds of ambiguity, cf. the word-end tody 
(~ 2 sg ~ I t 4 pl). 
(f) Systematic ambiguity concerning the 
distinction between geographical names 
and possessive adjectives derived from 
lexically corresponding personal names, 
cf. the word-end /Bene~ova (N 2 sg 
A N 2 sg; F 1 sg; S l, 4 pl: of Bene~ov ~o 
of Benes s). 
(g) Ambiguity concerning lemmatization, 
cf. the word-end ~ (K), corresponding 
to a single word-~--~v~, between 
lemmatization rules \[1; t\] and L2; et\], 
corresponding to the infinitives v~/v~it 
(to balance) and vyv~et (to export), 
respectively. Cf. also the surface-syntax 
ambiguity in lemmatization with the 
word-end ~ (cf. (bb) above), which 
is surface-s~/s-~ax ambiguous in gender 
(~\[: ~dek ~ F: ~dka). 
The present treatment of ambiguity is 
characteristic of the procedural 
conception of morphemics in that the 
method of accounting for ever~j etymologi- 
cal word-ending by means of at least one 
word-end (piece of output information) 
removes from the analysis the systematic 
ambiguity as well as morphemic irregula- 
rities (exceptions) concerning etymologi- 
cal word-inflection and word-formation 
endings, which have been usually treated 
by means of various restrictions and 
other ad-hoc means. Every case of the 
systematic etymological ambiguity is 
accountable for by several tens or even 
hun eds of pieces of output information (drthecf. systematic ambiguity of the 
word-formation ending ac/ as mentioned in 
section 3, or that of t-~ word-inflection 
ending~ among masculine-inanimate, 
masculine-animate and feminine nouns with 
additional morphemically irresolvable 
ambiguity concerning case and number: 
N l, 4 7 pl § ~ 4, 7 pl § F 2 sg; I, 4 
pl); on the other hand, exceptions to 
word-endings (in the form of word-ends 
with different output information) are 
accountable for by several pieces of 
output information (cf. the word-inflect- 
ion endin6 ~ as mentioned in section 2, 
which is accountable for by three pieces 
of output information, representing one 
exception, or the word-formation ending 
enl as mentioned in section 5, which is 
a-~ountable for by five pieces of output 
information, representing six except- 
ions). 
After resolving the cases of the syste- 
matic etymological ambiguity and of 
irre£u-larity, it is possible to list the 
remainir~_ (about one hundred) cases of 
morphemically irresolvable ambiguity 
(with the exception of the case-number 
ambiguity accompanying gender ambiguity); 
such a list can be compared to the list 
by (Panevov~, 1981) involving.~nbi~ous 
word-fo~nns in Czech. Panevov~ s list, 
not bein& lexically restricted with 
respect to specific applications, inclu- 
des also proper names, words not occur- 
ring in technical texts and forms not 
analyzed by the present algorithm (such 
as singular imperative with verbs), but 
on the other hand, it consists only of 
full word-forms, thus intersecting with 
the present list, where first of all 
ambiguous word-ends in the form of parts 
of words are involved. 
5. QUANTITATIVE ASPECTS 
The present conception of the algorithm 
of morphemic analysis is based on the 
absolute frequency of word-ends in tech- 
nical texts. In the ideal case, the word- 
-ends should be arranged with respect to 
the frequency of their last (rightmost), 
last-but-one, etc., symbols - a task 
which itself would require the aid of 
a computer; for the time being, we must 
184 
work with an approximation, which makes 
it necessary to divide the algorithm into 
two Farts according to the ass~nption 
that the first two hundred word-ends on 
the scale of absolute frequency, arranged 
according to a statistical examination 
concerning the whole word-ends, could 
resolve about fifty ~ercent of the words 
of ~ ~ technical text, while the other 
word-ends of the algorithm (pieces of 
output information), arranged according 
to the frequency of their last sD~bols, 
should resolve the remaim/~ ;ortion of 
a technical text• We assume that out of 
the about twenty thousand pieces of 
output information of the broadly concei- 
ved preliminary version of the algorithm, 
only several thousands will be sufficient 
to cover the words which may occur in 
a standard tecDmical text (this will lead 
to a substantial reduction of the preli- 
minary version of the algorithm)• 
The words included into the analysis 
fall into four major semantic hyper- 
-categories (not used in the semantic 
analysiu): (i) words with the most 
general semantics (including the forms of 
cate-orial verbs, Such as b_~ (to be), 
v reo~sitions, such as Z (in), etc.); 
(ii) general terms typical of technical 
texts (such as metoda (method), 
(system), ~tc.);'-~) words specific 
to the Liven technical domain, e.g. 
microelectronics (such as katoda 
(cathode), obvod (circuit),---~.), and 
(iv) words ~pical of other (possibly 
affiliated) domains (such as 
(brick), stTecha (reef), etc.). 
The conception of the most frequent 
two h~dred word-ends (which are 
ar, a~ed in a s~ecial algoritl~m) can be 
~luu.,~a by a list involving ten most 
_requon~ word-ends; in Czech technical 
...... , they belong to the first hy~er- 
. a.~ "0-"" c~ ~.. These word-ends are of throe ,:in.u; (=~ ,.",,ord-end~ in the form of 
LJarts 
~_ word-forms (which ma~ accidentally 
coincide with etymological word-endings, 
~uch as ~ch or @he); (ii) word-ends 
in the fozn of full word-forms (such ss 
~se or /ie), and (iii) word-ends in the 
fern: of Tarts of ~vord-forms resolvable 
v ;~ .... inor =xce~tionz (such as ~ or 
'~'~- suci~ 'vord-~nd ~ are indica--ted by 
• .. ' ~ ~ "~ 4  d. t on to th s, ti ere 
can be distincaished mgr~he~ical~Y . 
~Ic~biguous word-ends \[c~. /ha, /~, /v, 
u~:~ vs morohemicall~ ambi'~.~.Qous word- 
 ch, /se, o (f)) in the list in F~Eure ~, a±± case.~ 
~-~ "~ t~- includin~ the ambiguity in • ._ .~ibl~ll w ( ~ o ~, 
case and n~.iber) are indicated by .; 
with /je, for the sake of clarity, the 
uor~he:nic ~n_o~.a~.on is given directly 
by n~ans of English equivelents. 
_-~ _ ~requ~n~ v;crd-ends. 
2. /se -- Z~ (re~lex!ve) . ~ ( .... L) 
4. -- ~ l ~ 2 ~ ~ ~ 4 ~ 5 ~ ~ 
~. ~-- c (on, for) 
(and} 
• /v -- C (in) 
q u.le u'e --If 
ic,.~ -- A N ~ ~ 1 ~ 4 sg s' ,.U 4 sg 
6. C, CNCLU~I ON 
'.Te have described a not yet i::~;-le-lente.f 
but i,romising s~steu of a riiht-to-loft 
mori:hezzic analysis intended ~" _,~; t~c\]~qlcul 
texts in Czech a~qd based on c, cence2tion 
of morphemically tuqambi~J.ous or iz'resol - 
vably ambi~m/ous word-ends as o.nbodyin~" 
the cases of nor~henic ~-,;bii~,/ity in au 
inflectional language. ~"ne present systezu 
seems to be more economic than the 
nrevious systems (which £.re full? or 
partly based on the conception of et~.nno- 
logical word-endinjs (and word-stems)or 
on the conception of word-ends as 
consisting of a fixed, apriori established 
ntumber of symbols) in that it cen~ disi~ense 
with ar~ dictionary as well as with the 
notion of morphemic irregularity; more- 
over, it is capable of an interaction 
with the other levels of analysis, as 
well as of various adjustments. 
The advantages of the present system 
vis-a-vis the previous systems can be 
summarized as follows. 
(i) Due to the fact that every set of 
complementary word-ends (with respect to 
the tiven horizontal word-end(s)) is 
assigned a common piece of outf, ut infor- 
mation, s~d also to the fact that oven 
a single word-end often corresr:onds to 
several words (lexical units) \]~.nd/or 
to several word-forms, the ntt~,hcr -,f t!w 
pieces of output information necessary 
for resolving a standard teclmic~:.! text 
is presumably consider~.bly lower than the 
number of the word-forms \[of both inflect- 
ed and uninflected words) occurrin£ in 
such a text. 
(ii) The present system is able tc 
account far the word-forms of nay,' (n~;,,l~ 
coined) words with productive we--d- 
-endings automatically, without consi- 
dering their stems. 
(iii) The account of !:roductive v,'ord- 
-endings also enables to :~cco'~%t for 
semantically relevant word-ending~ b U 
indicatinL the se~nantically relevca~t 
pieces of output information. 
185 
P~F~NCES " 
!. B~\]ovi ~va. 198C. 0b odnoj 
vozmo~nosti semanti~esko.j klassi~ 
~l~ac:~l su~cestvitcl nych (Cn 
one possibility of semantic classi- 
fication of nouns). Pratique Bulletin 
of ~iathematical Lin&~istics 34, 
\]3-44. 
2. Haji3ov£ Eva and Sgall Petr. 1981. 
Tov~ards Automatic Understanding 
of Tecknical Texts. 2ra~-ue Bulletin 
of :~athematical Lin~ui~ics~ 36, 
~.~ ~\[irsclmer Zden~k. 1982. !~OSAIC - 
A :'cthod of Automatic Extraction 
of Tecbmical Terms in ~xts. ?rarae 
E_~ulletin of "/.athematical Lin~Is--~s 
.~. "2 37~ .--2,~. 
4. . 1982. On a device 
in dictiona~" operation in machine 
translation. COLING 82 - Proceedin~ 
of the Ninth Internati'0n~l Confe- 
rence in C6m~utational Linr%2~istics. 
Jo_ tn H011an~ _ Ac~/demia. 
T. ~(one~n~ D. and F~ronek J. 1960. 
:'~orfologick.4 anal#za podle posled- 
n4ho pfsmene (~;Tor~hological anal~- 
sis according to the last letter). 
Acts Universitatis Carolinae: 
~l_ ~v_c~ rra~ensia 2. Fra-ha. 
~. Fanevov~ Jarmila. 1981. Lexics~l 
InD:at Dats for ~xperiments with 
Czech. E~lizite Beschreibung 
~prac!~e und automatische 
.... ~-_ he~t~mL. VI ...... Faculty 
)f ~athenatics '~d Physics. 
7. and 3gall ~etr. 1979. 
_o~,:.i'd ~ Auto ~.~Ic Parser for 
~cn. International Review of 
~I ~ - • .... \] ~,<o. ~oustava ~adovych 
:fi case ending:~ in Czech). Ac+p. 
Universfitatis C.aroli, nae: Slavica 
2ra~ensia 2. 
~. Z.av<~Lov~ Eva. 1~7 =. Re~ro~r~dnf 
:~orfe:aat~ck~' slovn~,~ ceot~n E 
\[A retrograde morphematicd\[ctiona- 
ry of Czech). Praha: Academia. 
lC. 7cishcitelovg Jane. lO21..~ .~.utom~ ~c 
faaalysis of Czech i~orphcmics. 
2ra~e 3tudi_es in L7atheL~atical_ 
Lini\]~isticz 7, 223-236. 
ll. , V~gl/kovg Xv~ta 
-und Ggall 7etr. 1982..qorphemic 
~esohreihur.g der S~rache ,and 
~.~ut oust ische ~ .... .... t~rb.~ tun C VII. 
Praha: Faculty of ~/.athematics and 
~hysics. 
186 
