WERNER BRECHT 
MORPHOLOGICAL ANALYSIS (A FORMAL APPROACH) 
1. MOTIVATION AND BASIC IDEAS 
Since 1972 a researcher team in Bonn has been working on the 
automatic syntax-analysis of the german language. The morphoJogical 
analysis is a part of this work that has already been formalized and pro- 
grammed by the author. We can consider the following paper as a 
generalization of this formalized description. 
In this first chapter some expressions like "text" or "lexicon " are 
considered clear by intuition. Later on we'll get to know the exact 
definitions. The basis of each description of a text of any language is 
a morphological analysis of this text. One can easily agree that such a 
description has to be derived from the words or sentences of the text 
which is to be described. The expression "description of a text" is 
understood in a very general sense.. One can imagine a syntactic descrip- 
tion or a semantic interpretation or a combination of both of them or 
any other information. 
In a natural language the number of the possible texts is not finite. 
That's easy to prove because of the following sentences: 
One is a number 
Two is a number 
Hence it's impossible in practice to use a lexicon of the following 
form 
text 1 description of text 1 
text 2 description of text 2 
o,w °.° 
10 ~rERNER BRECHT 
for all texts of a language. There remains only one possibility. One 
has to ascribe the words and sequences of words which every text 
consists of with one or more (homography) descriptions. Then one 
can try to derive the descriptions of the text out of the descriptions of 
the words or sequences of words. 
2. BASIC DEFINITIONS 
2.1. Remark. 
Let A and B be sets. Then denotes: 
P(A) : the set of all subsets of A (the powerset of A) 
A* : the free monoid over A 
A × B: the cartesian product of A and B 
I xl : the "length" of x ~ A* (the number of elements of A 
which x consists of);I xl ~ IN. 
Now we'll define the expressions "character " and "string" for 
a language. We use five basic sets: 
LETT := {A, B, G, ..., Z} the set of "letters " 
DIG := {0, 1, 2, ..., 9} the set of " digits " 
BLANK: = {L_a} =: {blank},/t._a/= 1, I_..a E BLANK* 
PS : = {.} U {,} U { !} U ... the set of "punctuation-signs " 
SS := {%, &, ~, ...} the set of " special-signs" 
From these five sets we derive: 
a) CHAR: = LETTU DIG U BLANKU PS U SS the set of 
"characters" 
If x ~ CHAR, we say: "x is a character ". 
b) CHAR*: the free monoid over CHAR. 
If x ~ CHAR*, we say: "x is a string " 
or: x is a sequence of characters" 
or: x is a text " 
c) LISS: = LETTO DIG U SS the set of "Characters without 
blank and punctuation-signs" 
d) LISS*: the free monoid over LISS. 
If x ~ LISS*, we say: "x is a string without blank and punc- 
tuation-signs ". 
MORPHOLOGICAL ANALYSIS 11 
2.2. Remark. 
\[LISS c CHAR =)LISS* c CHAR*\] =) \[e ~ LISS*, e empty 
element ~) e ~ CHAR*, e empty element\] 
Now we define the expression " word ". 
2.3. Definition. 
WORD1 : ~ {x \] x ~ BLANK* A J X\[ > O} 
WORm: ---- {x I x ~ LISS*,, I~1 > O} 
WORD : ~ WORD1 U WORD2 UPS 
If x ~ WORD, we say: "x is a word ". 
2.4. Examples. 
t_a t_a t._a ~ WORD, because t_a t_a t...a ~ WORD1, I t...a t:_a t..a \[ = 3 
WHEN E WORD, because WHEN ~ WORD2, I WHEN I~ 4 
! ~WORD, because ! ~PS (1! -~1, '! ' 
regarded as an element of PS*) 
But 
WHENt_..a ~. WORD 
STOP! ~ WORD 
~ ~ ! ~ WORD. 
3. THE INPUT FOR THE MORPHOLOGICAL ANALYSIS 
Our analysis will accept every x ~ CHAR*. 
3.1. Remark. 
x ~ CHAR*, I xl = 0 (x = e) is a trivial case because there is no- 
thing to analyse. 
Let x be a text, x ~ CHAR*. If we want to analyse x, we say: "x 
is the input for the analysis " or for short: "x is the input ". 
3.2. Examples. 
a) x ~ L_a I_A WE'LL t_A GO t__a t._a t..a ON! t._a L..a 
b) x ----- 17.23 t._a q- L.A 11.00 t._a ~ t.._a 28.32 t.A -- t_.a 0.9 
c) x = %%%%%%%%%%%%%%%,, I/I//fill/I/'' '' '' 
ABCDEF L_.a t_A ???. 
~ ",um~ml~- lmuur- 
12 WERNER BRECHT 
4. TIIE SEGMENTATION OF TIIE INPUT 
We want to divide some text in a well-defined sequence of words 
and then take off the blanks. 
4.1. Definition. 
Let segml be a mapping between CHAR* and (P(WORD))* 
segml : CHAR* ----+ (P(WORD))* 
X ----'-~ y 
such that 
1) e----~ % (%: empty dement in (P( WORD))*) 
2) I xl > o ^ x = a~azaa...ak ^ aie CHAR (i = 1, 2, ..., k) 
=> r = {w~} {wd {w~}... {w~},, w, ~ WORD,, 
WlW2Ws...Wra = X^ \[Wi\[ maximum (i = 1, 2, ..., m). 
4.2. Example. 
Let x be: x: = GOOD u_~ t__l t__l DAY! 
Then 
a) y':={GOOD} {t._at_l} {t._a} {DAY} {!} 4= segml(x) because 
\[w21 is not maximum. 
b) y": = {GOOD} {t_..It_.a} {DAY} {!} 4= segml(x) because 
GOOD t__a t_J DAY! 4= x 
c) But y= {GOOD} {t__a t__a t_a} {DAY} {}} will fit. 
y = segml (x) 
4.3. Remark. 
a) segml is ONE-TO-ONE 
Proof. 
r ~ segml (CHAR*)^ y' e segml (CHAR*) ^ y = y' 
~r = (w;} {w;} (w-,} => x' ' ... = wiw',...w', 
, , , X t y -= y' => wlw~...w,~ = w~w~...w,,, => x --- 
b) segml is ONTO 
2 
MORPHOLOGICAL ANALYSIS 13 
Proof. 
y ~ (/'(WORD))* => 
1) y = ep =) ~\[ e ~ CHAR*: segml (e) = ep 
2) y 4= e, => y = {w,} {w2} ... {w,,} 
Let x be: x: = wlwv..w,,,e CHAR*. Then segml(x)=y 
c) Hence segml is a bijection. 
4.4. Definition. 
Let segm2 be a mapping between (P(WORD))* and (P(WORD))* 
segm2: (P(WORD))* -- ~ (P(WORD))* 
y ~z 
such that 
1) ep ~ ep 
2) y 4= ep ^ y = {wl} {w,} ... {w,,} ^ w i e WORD (i = 1, 2, ..., m) 
=> z = {w,,} {w,,} ... {w,,} ^ w,i   WORD1 ^ 
k,e{1, 2 .... , m} (i=l, 2 .... , n)^l_<kl<k,.<... <k,_<m 
We change our notation 
u,:=w,, (i=1, 2, .,., n) 
and get 
z = {.s} {.=} ... {u.}. 
4.5. Example. 
y:={GOOD} {t.._J I....J I....~} {DAY} {!} 
z ={GOOD} {DAY} {!} 
4.6. Remark. 
a) segm2 is not ONE-TO-ONE (it is MANY-TO-ONE) 
Proof. 
segm2({w} {t_.at_a}) = segm2({w} {t..a}) " 
b) segm2 is not ONTO 
Proof.. 
Because of w,, f WORD1 there exists no r ~ (P(WORD))* such 
that segm2(y) = {t..a} 
qt 
14 W~aNER ~CHT 
4.7. Definition. 
A segmentation of a text is a map segm between CHAR* and (P 
(WORD))* such that 
segm: ~- segm2 o segml 
The following diagram is commutative: 
CHAR* segmI ) (P(WORD))* 
\ x\ xC--~O,,} = 
\ "lug} ... {"-~/scum2 
~(P(WORD))* 
4.8. Remark. 
segm is neither ONE-TO-ONE nor ONTO 
Proof.. 
segm2 is neither ONE-TO-ONE nor ONTO. 
We call every z ~ segm(CHAR*) "a segmented text" 
mented input" 
or a seg- 
4.9. Example for a segmentation. 
segm 
x =t_aWHATt..A AREt..At..AYOUt..ATHINKINGt..AOFt..A? 
- ; segml 
y = {k...l} (WHAT} {I..,I} (ARE} {t..._l u_l} (YOU} {t__l} {TmNKINq (,,} (Or} (,,} (.~} 
T segm2 
z =(WHAT} (AR~} (YOV} (TmNKINC} (OF} (.~} 
MORPHOLOGICAL ANALYSIS 15 
5. REMARKS TO THE CONCEPT '~ LEXICON " 
• In using the expression "lexicon "all actions identifying and describ- 
ing words, sentences and texts can be concentrated in a single concept. 
In a formal sense any lexicon is a set of "items ". 
Definition. 
LEX: = ({w, B)x}, X e A (.4: any index-set) 
The pair (w, B)x is called an "item " of the lexicon. 
For every item holds: 
a) w={u,1} {w,} ... {w=}, m> 1, w i~ WORD2U PS (i=1, 
2, ..., m) 
b) w~ (i = 1, 2, ..., m) is fixed as a word or punctuation-sign of 
a language. 
c) B is any description of w. 
Let 113 be the set of all intended descriptions of all sequences of 
words and punctuation-signs of some language. Then 
LEX ¢ (P(WORD))* X 113 
such that 
(., B) L X: b) c) 
LEX is a relation between (P(WORD))* and XB. In general a se- 
quence {wl} {w,} ... {w=} has more than one description (by homography 
for example). Hence LEX can't be a map between (P(WORD))* and 
113. 
There are two ways to define a lexicon. 
a) The extensional definition. All the elements of the lexicon are 
listed off. In this case we often call such a lexicon a "list " 
Examples. 
1) A: = (1, 2, 3} 
x is element of A: ** x = 1 ,, x = 2 v x ----- 3 
2) w is a noun: ** w is element of a list of nouns. 
b) The intensional definition. All the elements of the lexicon are fixed 
by some common properties. 
..... ~ ....... ~ ......... ~ ~ . V ~ ~ ~.~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
16 W~RNm~ BRECHT 
Examples. 
1) A:={x\[ x~IN^O<x<4} 
x is element of A: .,* x e IN^ 0 < x < 4 
2) w is a noun: ,, w has some characteristics like prefix, suffix and 
SO on. 
3) w is a verb: ~* a part of w (the stem) is element of a list and 
suffix and prefix have some characteristics. 
In both cases (extensional or intentional) the lexicon has the abstract 
form: 
w description of w (w,B) 
OT 
6. TI--m MORPHOLOGICAL ANALYSING STEP 
Let x E segm(CHAR*) 
Case I. 
x = ep. There is nothing to analyse. 
Case 2. 
a. ~ ~N such that x = {Ul} {.~} ... {..}^ 
ui e WORD2 O PS (i= 1, 2, ..., n). 
Let {q, t~ .... , t,}c_ {1, 2, ..., n}; p ~> 1 
Let k be a map k: {tl, t, ..... tp}---+ {tl, t,, ..., tp} 
such that k is ONE-TO-ONE and ONTO. 
Then (k(q), k(&), ..., k(tp)) is a permutation of (tx, t, ..... tp). 
We call 
{u~(tl)}, {uk(t,)} ..... {u,(tp)} a " subsequence" of {u~} ... {u,} 
Examfle. 
Let x = {u.} {u.} {,.} 
Then 
a) {,.}; (p= 1) 
b) (u,} (,.}; (p = 2) 
MORPHOLOGICAL ANALYSIS 
c) {u3} {us} {u~}; (p = 3) 
d) {ux} {us} {us}; (p = 3) 
are subsequences of {u~} {us} {us} 
17 
Definition. 
Let T be the set of all subsequences (derived in the above shown 
manner) of a given x = {ul} {us} ... {u~} 
Let t ~ T be such a subsequence. 
Then 
a " morphological analysing step " related to the subsequence t (for 
short: mast) is a relation between {t} and LEX. 
mas t c {t} 3< LEX 
such that 
(t, (w, B)) ~ mast: ~ t= w 
Case I. mast * O 
Then we say: we have identified the subscquence t in our lexicon 
and all related B's are descriptions of t. 
Case 2. mas t = O 
Then we say: our lexicon does not (yet) contain the subsequence t. 
We are not able to give any description of t. 
7. THE MORPHOLOGICAL ANALYSIS 
The concept of the "morphological analysing step " is related to 
one and only one subsequence t e T. 
The concept of the " morphological analysis " however is more 
general. 
Let x ~ segm(CHAR*), x. ep, x = {ul} {us} ... {u,} 
Let T' be a subset of T (T' ~ T) such that to every (u~} (i = 1, 
2 ..... n) there exists at least one t e T' which contains {u,}. 
18 WERNER BRECHT 
7.1. Definition. 
A " morphological analysis related to T' " (for short: mar,) of {ul} 
(u2} ... {u,) is the set of all mast such that t e T'. 
ma T, : = {mas, I t ~ T'} 
Remark. 
Let i~ {1, 2, ..., n} 
Let t~i be a subsequence of (ul} {u~} ... (u,} containing (u,}. 
Let T~ be the set of all t~. 
We say that our analysis failed if there exists one {ui} such that 
mas%= 0 for all t~ e T~ 
In the other case we say that our analysis had been successful. 
In general there are more than one T' such that mar, is successful. 
It has to be left to the user to fix the sets T' for his special intentions 
and for his special possibilities. 
7.2. Definition. 
Let AT be the set of all T' (T' defined as above). 
A "morphological analysis " (for short: ma) is the set of all mas 
such that there exists a T' e AT with t e T' 
ma: ={mas~ I ~t T' e AT^ t e T'} 
Remark. 
Let z ~ CHAR* be a text such that there exists a x = segm (z) with 
x :# %. Then in practice we say: ma is a morphological analysis of the 
text z. 
8. EXAMPLE FOR A PRACTICAL MORPHOLOGICAL ANALYSIS 
This example shows the practice of a morphological analysis of a 
german text and has indeed been programmed in Bonn to be the basis 
of the above mentioned syntax-analysis. 
Let x ~ segm(CHAR*), x * ep, x = (Ul) (u2} ... (Un} 
Let T' be the following set: 
T':={t,\] t,: ={u,}, i--1, 2, ..., n}c T 
T' = {tl, t2, ..., t,} 
MORPHOLOGICAL ANALYSIS 19 
Then holds: 
masqc {(u,}} X LEX (i-= 1, 2, ..., n) 
such that 
({u,}, (w, B)) e mas,,} ~ w = {u,} 
This simple case of a morphological analysis we call a 
word-analysis' of a given text. 
We get 
mar, = {mast~, mast,, ..., mas~} 
Each massi (i = 1, 2 .... , n) is a set too. 
'word-by- 
Hence we have to write: 
mar,-----{ {((ul}, (w,, Bu)) .... , ({ul,} (wi, B1J)}, 
{({u,}, (w2, Bus)) ..... ({u,}, (w,, B2z,))}, 
°o. 
{((.n), (w., B.,)), ..., ((u.), (w., ) 
For short we can write: 
mar,: \[(u,} <-> ((w o B,,), ..., (w o B,q)) (i= 1, 2 ..... n)\] 
Or: 
maT,: \[{u,}-¢-> (Bil; Bi2 , ..., BiLi) (i -~-- 1, 2, ..., n)\] 
Now one can see that the result of a word-by-word-analysis can 
easily be represented with the following matrix-concept: 
ul, Bll, B12, • .......... Blz~ \ 
u~, B~I, B~,. .......... B2L,| / 
,,. 
\u.,, B,a, B.2 ............ B.L,, 
In our syntax-analysis in Bonn a great deal of the morphological 
analysis is done by word-by-word-analysis. We are successful in describ- 
ing articles, nouns~ adverbs, adjectives and so on, but we have some 
trouble with our verbs. 
In the german language the prefix of some verbs may be found 
far away from the stem of the verb. 
20 V¢~RNE_R Bl~CHT 
Example. 
The verbs zulaufen and laufen are two quite different verbs. We 
will regard the following three german sentences: 
1) Ein I-Iund ist mir zugelaufen. 
2) Lauf mir nur nicht zu. 
3) Zu ist er mir gelaufen. 
A word-by-word-analysis will succeed only with sentence 1). In 
2) and 3) we'll find the verb laufen instead of zulaufen. That means 
that we get a wrong description of our verb and a wrong description 
of zu which .exists in the german language also without any relation 
to a verb. Therefore to analyse our verbs a word-by-word-analysis is 
impossible. 
In our analysis we differ between two parts of the lexicon. 
The first one allows word-by-word-analysis and is intensionally 
defined for proper-names, nouns and adjectives and is extensionally 
defined for all other words without verbs. The extensionally defined 
part of this lexicon consists at this time of nearly 2000 items. 
The second one is our verb-lexicon which is intensionally defined. 
There exists an extensionally defined verb-stem-lexicon which con- 
tains at this moment the stems with their prefixes of nearly 400 ger- 
man verbs. This stem-lexicon is quickly increasing and is coded in 
the following manner: 
\[(stem} \[description of "stem "\]\] 
\[(prefix} (stem} \[description of "prefix stem"\]\] 
\[{lauf} \[description of "lauf"\]\] 
\[(zu} (lauf} \[description of "zulauf"\]\] 
We start the morphological analysis with a word-by-word-analysis. 
If our analysis was successful we have got to each word of some text 
at least one description. Some of these descriptions may be wrong 
That's because of the homography and because of the verbs. We can't 
solve the homography-problem in this early part of the analysis. 
If we have identified a word to be a verb we are looking if in the 
same sentence there exists a word which can be prefix of this verb. 
If we find a possible prefix the verb gets the descriptions resulting 
of the prefix as well as the descriptions without this prefix. Working 
in this way we get a lot of information for the words of our text. Some 
information is wrong but we can be sure that the right information 
MORPHOLOGICAL ANALYSIS 21 
is among the descriptions. It is left to the syntax (or maybe to the se- 
mantic) to isolate the right descriptions. 
Formally we can describe the verb-analysis as a set of masj such 
that t: = {u'} {u"} where {u"} has been recognized as a verb and {u'} 
can be every word (without {u"}) of the same sentence in which {u"} 
exists. 
Given some text {u}l {u2} ... {un}. 
Given a word-by-word-analysis which shows that {ui} may be a 
verb. 
T'::{t I t~-{u'} {ui}^u' e{ul, ..., ui.,, ui.,, ...,%} } 
Then holds: 
mas, ¢ { (u'} {ul} } x LEX 
such that 
({u'} {u,}, (w, /3)) e mas, ¢, w = {u'} {u,} 
One might call this procedure a "two-word-analysis ". We can 
imagine a " three-word-analysis " and so on too, but up to now in 
our practice in Bonn the morphological analysis consists only of a 
" word-by-word-analysis " and a " two-word-analysis " in the above 
shown manner. 

