" L e x i f a n i s " 
A Lexical Analyzer of Modern Greek 
Yannis Kotsanis - Yanis Maestros 
Computer Sc. Dpt. - National Tech. University 
Heroon Polytechniou 9 
GR - 157 73 - Athens, Greece 
'l' ~criture fait du savoir une f~te' R.BARTHES 
ABST~ 
Lexifanis" is a Software Tool designed 
and implemented by the authors to analyze 
Modern Greek Language (~AnuoTL~'). This 
system assigns grammatical ~lasses (parts 
of speech) to 95-98% of the words of a 
text which is read and normalized by the 
computer. 
By providing the system with the 
appropriate grammatical knowledge ( i.e.: 
dictionaries of non-inflected words~ 
affixation morphology and limited surface 
syntax rules ) any "variant" of Modern 
Greek Language (dialect or idiom) can be 
processed. 
In designing the system, special con- 
sideration is given to the Greek Language 
morphological characteristics, primarily 
to the inflection and the accentuation. 
In Linguistics, Lexifanis, can assist 
the generation of indexes or lemmata; 
on the other hand readability or style 
analysis can be performed using this 
software as a basic component. In Word 
Processing this software may serve as 
a background to build dictionaries for 
a spelling checking and error detection 
package. 
Through this study our research group 
has set the basis in designing an 
expert system " which is intended to 
"understand" and process Modern Greek 
texts. Lexifanis is the first working 
tool for Modern Greek Language. 
" ~AeEL~,i~n~ ~ : Who Brings the Words 
to Light. Name given by Lucian (circa 
16@ A.C.) to one of his dialogues. 
PROLOGUE 
In Linguistics the systematic identi- 
fication of the word classes rises seve- 
ral questions in regard to the morphemic 
analysis. In Computational Linguistics 
several research areas use fundamental 
information such as the "word class" of 
a given wordy isolated or in its context. 
In Computer Science the automatic 
processing of Greek texts is based on 
relevant knowledge, at the lexical level. 
In an effort to present a software 
tool intended to identify the grammati- 
cal classes of the words we have de- 
signed and implemented Le×ifanis. We 
have used modern greek texts as a test- 
bed of our system, but Lexifanis, can 
process any "variant" of modern greek, 
and even ancient greek language, provided 
that it is appropriately initialized. 
In this paper s whenever we use the 
term greek or greek language we refer to 
the modern greek language (~AnuoTL}::~') 
in its recent monotonic version (i.e. a 
single accent is used, instead of three, 
and there are no breathings --~n~'¢O~,=T,=') 
WORD CLASSES 
We have found that morphological analy- 
sis of the greek words can provide ade- 
quate information for the word class 
assignment. The majority of the words 
in a text can De assigned a unique 
( single class >. However, there exist 
some words that may be assigned two "pos- 
sible" classes. This ambiguity is 
inherent to their morphology. On the 
other hand we know that consideration of 
the words in their context may dis- 
ambiguate this classification, if re- 
quired. In this work there is no need 
to use any stem dictionary. 
154 
The ~undamental information used by 
Lexifanis to provide the classes of all 
greek words is extracted from the affixa- 
tion morphology and especially from a 
morphemic suffix analysis. In this do- 
main, we follow three axes of investi- 
gation : the "Accentual Scheme", the 
"Ending" and the "Pre--ending" of each 
word. 
Accentual scheme 
The "accentual scheme" of the word 
reflects the position of the stress on 
the word; The stress may come only on one 
of the last three syllables ( law of the 
three syllables ). This scheme is iden- 
tified in our system by a code number. 
Table 1 lists all possible schemes and 
their corresponding identification codes 
(IC). 
TABLE 1 : "accentual scheme" of 
the greek words 
accent. 
scheme I_~C example 
" +} @" : will 
:e I ~a, nw~ : will,that 
~e 2 nQ~(;) : what(?) 
~ee 3 natO\[ : child 
~ee 4 xdon : grace 
eee 5 ~oxa'~>~ : archaic 
eee b out',~T~ : I compose 
eee 7 no~6~nu,= : problem 
Notation 
: "word start delimiter" 
e "syllable" 
"accent" 
"apostroph" 
An example to illustrate the above 
feature is the following: 
~SL-+O~t-O-OO-t'n (:justice> IC=& NOUN 
xo~.-U.5-.~u-vn (:joyful> IC=7 ADJ 
Ending 
A detailed suffix analysis of the 
highly inflected greek language \[KOYP,bT\] 
\[MIRA,59\] indicates that there exist mor- 
phemes at the end of the word which can 
be used to identify the grammatical clas- 
ses of the words. 
The morphological analysis, presented 
in this paper~is based on a right-to-left 
scanning of the words. This analysis 
identifies word suffixes, named hence- 
155 
fourth endings. These endings may not 
necessarily coincide with the inflectio- 
nal suffixes, described in the greek 
grammar \[TRIA,41\]. Consider for example 
the following pair of words highlighting 
the difference in the ending of the two 
words. ( In this example the ending is 
the inflexional suffix, as well ). 
~xT¢~ - mo - n (: execution) NOUN 
mx~ - $o - .~ (: I have executed) ADJ 
Notice the identical accentual scheme 
of the above two words. 
Pre--ending 
On the other hand, these endings re- 
flect the incidental cases of morphemic 
ambiguity \[KOKT,85\] in the inflectional 
greek language. This ambiguity can be 
resolved if we further penetrate to the 
word to identify what we call pre--ending. 
This pre-ending, in most cases, can be 
easily used to disambiguate word 
classes and it yields to a unique class 
assignment when the ending alone is not 
sufficient. Generally, the pre-ending 
does not coincide with the derivational 
suffix of the word under consideration 
\[TPIA,41\]. 
Let us now consider the following 
example : 
xd$' - ate (: you have done> 
.9~vaT - ~ (: death, in vocative case~ 
where,the consideration of the linguistic 
inflectional sufi×es -uTz and+m are com- 
pletely misleading, as far as the class 
assignment is concerned. You may notice 
that these two words have the same pre- 
ending -,=T-. In this case a further 
morphemic penetration in the word is 
required to resolve the ambiguity \[KRAU, 
81\]: 
i~v- ,=T - ~ VERB 
@,it" - ,~T - m NOUN 
The morphemes identified at this last pe- 
netration may not necessarily form the 
stem of these words. Our system clas- 
sifies the first word as a verb and the 
second as a noun. 
Words in their Context 
Finally, if more ambiguities exist in 
word class assignment, a consideration of 
the "words in their context" may be added 
to the affixa~ion morphology. This clas- 
sification technique is fruitful in 
poorely inflectional languages, such as 
English \[CHER,8~\], \[KRAU,81\], \[ROBI,82\]. 
This syntax analysis is recommended 
when the tas~ is to determine the classes 
of the words in a ~hole text, as op- 
posed to the class assignment to isola- 
ted words. By this analysis we gain in- 
formation from up to two words that pre- 
cede or follow the word under classifica- 
tion \[TZAP,53\]. The following is a clas- 
sic disambiguation example : 
ol ~vT~¢o - ¢~ <: the contrasts) NOUN 
~ ~vT~o - ¢~ <: to contrast) VERB 
IMPLEMENTATION 
Dictionaries of N~n--lnfle~t~d Words 
Greek language is highly inflected. 
However, due to the fact that one out of 
two words of a text is a non-inflected 
word we have constructed the dictionaries 
o~ non-inflected words containing about 
4~ entries. In these dictionaries we 
accommodated all the non inflected words, 
that have no derivational suffix, of mo- 
dern greek, such as particles, pronouns, 
prepositions, conjunctions, homonyms,etc. 
and the inflected articles. 
Each word that enters Lexifanis is 
first searched in these dictionaries. 
If there exist an identical entry, its 
class is assigned to this word. Fig. i 
lists some of the entries of these di- 
ctionaries. As an example consider "o~o" 
(:to the, it). This word can be either 
"article with preposion" or "pronoun". 
art : 
art_pron : 
art.prep : 
art,prep_pron : 
prep_pron : 
pron : 
prep : 
conj : 
homonym : 
particle : 
num: 
adv : 
n O Ot TWV 
Tn T~R TOU ... 
,~Tn~ ~TOU ~TWV 
OTn ~TO ~TQ ... 
Uou ~uq eu~vu ... 
~aL a~ ... 
~50o ;Suo TO¢~q ... 
noO ~¢~a x~¢q ... 
Fig. I Part of the Dictionaries 
of Non-lnflected Words 
Morpholoqical Analysis 
The Morphological Analysis is perfor- 
med using about 250 rules. The user may 
add, delete or modify anyone of these 
rules. These rules contain all the in- 
formation relevant to the endings and 
pre-endings. During this phase, the in- 
flected words, mainly verbs and nouns, 
are identified. Efficient search is 
carried out using the accentual code, 
mentioned above. 
EXAMPLE: "Five" Morphological Rules : 
<leZ/eE> <n/nq> : noun 
"-:eE> <~l~ql¢> : verb 
,~¢~16~1,5p~.=:: :- <u.'~/~> : name 
,: dU,~;' > .::1 al,:q / m~ >'- : noun 
<auo~ > <:1 Q;.' ). : noun 
Notation 
e 
"word start delimiter" 
"syl lable" 
"accent" 
"ex I usi ve or" 
Li mi ted Syntax Anal ysi s 
When we want to analyze and classify 
the words of a text as a whole, Lexifanis 
examines the word under consideration in 
its context. This can be accomplished by 
invoking the nearly 25 Limited Surface 
Syntax Rules. 
This step is recommended, in case 
a word, is assigned two possible classes 
<double class assignment), see Table 2, 
using only the affixation morphology. 
This double class assignment is due to 
the ambiguity inherent to the morpho- 
logy of the word. 
EXAMPLE: "Two" of the limited surface 
syntax rules : 
<prep_pron> <verb> 
=> <pron> .::\]verb> 
<prep_pron > <art_pron > <uncl ass> 
=> <prep> <art> <name.> 
T~ SOFTWARE SYSTEM 
Lexifanis is a set of structured pro- 
gramms impl~mented in two versions : 
* The BATCH system, assigns classes to 
the words of a whole text. This system 
performs the limited syntax, mentioned 
above, in addition to the morpholog,/. 
* The INTERACTIVE system, assigns classes 
to isolated words. This system performs 
only the morphological analysis. 
Structure of Lexifanis 
The whole software system is designed 
and implemented in MODULES or PHASES, ti~ 
structure of which is illustrated in the 
156 
Block Diagram of the Figure 2. The de- 
scription of each module follows. 
INITIALIZATION - During this phase two 
processes take place : 
* the creation of the Dictionaries of 
Non-lnflected Words~ and 
* the generation of the appropriate 
Automata required to express the mor- 
phological rules and the surface 
syntax rules 
INPUT AND NORMALIZATION OF THE TEXT- The 
interactive version of the software sys- 
tem performs only the accentual scheme 
process, whereas the batch version per- 
forms this process in parallel to the 
input and normalization processes. Norma- 
lization or Word Recognition is the task 
of identifying what constitutes a word in 
a stream of characters. 
SUFFIX ANALYSIS - This is the main 
process of our system which is activated 
for words not contained in dictionaries. 
Finite State Automata \[AHO ,79\] are used 
to represent the morphological rules. 
LIMITED SYNTAX ANALYSIS - The relevant 
information is represented by automata. 
Fig. 3 the ... two dimentional garden 
I: set up dictionaries sl 
of non-inflected words 
g~ate morphological & 
limited surface syntax rule 
~i input and n(x'maltze text 
identify acc.~hm of wordsJ 
~earch in dic~ionaries~ m~ fmm~ 
f non-inflectedl ~ds) 1 
I " r0.r,o,- ----,. ; Llmorfological) analysi ~perform limit~ ) 
Lsurface syntax analysis 
I rocess & output the J 
results 
Fig. 2 Structure of Lexifanis 
SEARCH IN DICTIONARIES - All the Non- 
Inflected Words, with the same accentual 
schemer and word lengthy are grouped 
together forming a set of small dictio- 
nary-trees, "cultivated in a two dimen- 
tional...garden", minimizing thus the 
search time (Fig.3). 
RESULTS - This module is best fitted to 
the batch version of our system, but it 
can be used in the interactive version~ 
as well. 
TABLE 2 : Results obtained from 
a Scientific Text 
sinqle classes 
after 
morph. 
analys. 
% 
after 
surface 
syntax 
% 
I. article 5.16 13.53 
2. article with prepos. 0.00 1.2@ 
3. pronoun 5.11 6.42 
4. numeral 3.91 3.91 
5. preposition 2.96 5.26 
6. conjuction b.47 8.22 
7. adverb b. 12 6.12 
S. particle 0.60 0.70 
9. noun 12.73 12.98 
I~. proper noun 0.3~ 0.30 
11. adjective 7.2T 7.27 
12. participle 1.50 1.5@ 
13. verb 13.18 13.18 
&5.31 8e.&e 
do~!ble classes 
14. art_pronoun 11.78 
15. art with prep_pron 1.25 
16. preposition_pronoun 2.36 
17. non-inflected homonym 2.71 
18. name : noun_adject 11.33 
19. adject_adverb 2.06 
2.16 
@.0@ 
@.05 
@.85 
!1.33 
1.8@ 
31.48 16.69 
unclassified words 3.21 2.71 
157 
The Results concerning the classifica- 
tion of a greek text, are summarized in 
TaPle 2. 
* A single class is assigned to 80-90% 
o+ the words of any text, 8-15% are as- 
signed two possible classes (double class 
assignment),and the remaining 2-5% o+ the 
words, are left unclassified. 
* The variation o+ the above percenta- 
ges is due to the difference in style o+ 
the texts being processed. A scientific 
writing, for example, contain fewer ambi- 
guities than a poem. 
COMPUTATIONAL DETAILS 
Lexi+anis" modules are written in 
"Pascal" programming language. This 
software runs under NOS operating system 
on a Cyber 171 main frame computer. Top- 
down design and structured programming 
guarantee the portability o+ this pro- 
duct. 
The system uses about 35 Kilowords of 
the Cyber computer memory (60bits/word) 
and it requires 12 seconds "compilation 
time". The batch version classifies the 
words at a rate o+ 110 word classes per 
second. 
AIMM_IP~TIONS 
Lexifanis is a complete software tool 
which assigns classes to isolated words 
entered by the user or, alternatively, to 
all the words of an input text. This sys- 
tem can be useful to a variety of appli- 
cations, some of which are listed below. 
The modularity in its design and imple- 
mentation, along with the generality of 
the concepts implemented guarantee a pro- 
perty to our system : it can be easily 
integrated into various software systems. 
The most apparent application o+ Lexi- 
~anis is, in Lexicography, the generation 
of "morpheme-based" dictionaries and the 
generation of lemmata. 
Lexifanis may serve as a background in 
a spelling checking and error detection 
package , or any "writers aid" software 
system. 
Finally, Machine Translation woulO be 
another major area of application where 
Lexifanis may be included, as a module or 
process, in an "expert system". 
EPILO6~JE 
... we have presented a software tool, 
~hich assigns grammatical classes to 
the 95-98% of the words o+ a given text. 
This system performs suffix analysis 
~o assign classes to all the greek words. 
For the first time accentual scheme has 
been proved useful in the classification 
of greek words. Moreover, ambiguities 
inherent to the suffix morphology of 
greek words can be resolved without any 
stem dictionary ... 
REFERENCES 
\[ KOYP, b7 \] : F. KououoO2n, A'VT ;, ,.~TO.S.q0Ov 
Om~ t x6v "rn~ N~c:~ E22n'v t }~c;, Ac~nv,~, 1.96..-' 
\[TZAP,53\] : A. TC~OT~avo~, N~o~n~'ti~n 
~OvTaEt~, 2 T6Uol, A@~va, 194b/1953 
\[TPIA,41\] : M. A. To~.=VTa~UA3i6n~, N~o- 
m3nvlx~ FOqUUaTt~, A~v,~ 194111978 
\[AHO ,79\] : A.Aho, Pattern Matching in 
Strings, Symposium on Formal Language 
Theory, Santa Barbara, Univ. of 
Calli+ornia, Dec. 1979 
\[CHER,80\] : L.L.Cherry, PARTS-A System 
+or Assigning Word Classes to English 
Text, Computing Science Technical 
Report #81, Bell Laboratories, Murray 
Hill N3 07974, 1980 
\[KOKT,85\] : Eva Koctova, Towards a New 
Type of Morphemic Analysis, ACL, 2nd 
European Chapter, Geneva, 1985 
\[KRAU,81\] : W.Krause and G.Will~e, Lem- 
matizing German Newspaper Texts with 
the Aid of an Algorithm, Computers 
and the Humanities 15, 1981 
CMIRA,59\] : A . Mirambel, La Langue 
Brecque Moderne - Description et 
Analyse, Klincksieck, Paris, 1959 
CROBI,S2\] : J.J.Robinson, DIAGRAM : A 
Grammar for Dialogues, Comm. of the 
ACM, Vol.25, No i, 1982 
\[SOME,SO\] : H.L.Somers, Brief Descri- 
ption and User Manual, Institut pour 
les Etudes S~mantiques et Cognitives, 
Working Paper #41, 1980 
\[TURB,81\] : T. N. Turba, Checking for 
Spelling and Typographical Errors in 
Computer-Based Text, F'roceedinqs of 
the ACM SIGPLAN-SIGOA on Text Maniou- 
lation, Portland - Oregon, 1981 
\[WINd,83\] : T. Winograd, Language as a 
Cognitive Process, Vol. I : Syntax, 
Addison - Wesley, 1983 
158 
