A Rule-based Hyphenator for Modern 
Greek 
Theodora I. Noussia • 
Computer Technology Institute 
The purpose of this paper is to formally examine hyphenation as it pertains to Modern Greek with 
the aim of achieving accurate and thorough machine hyphenation. Grammar rules are interpreted 
and formally expressed in terms of regular expressions of word substrings, and exact hyphenation 
rules are derived. Vowel splitting, which traditionally is indicated in terms of prohibitive rather 
than explicit grammar rules, is examined in detail. Many ambiguities caused by circular defini- 
tions of the prohibitive rule vowel sequences are detected, an overwhelming majority of which are 
resolved within the present framework. 
1. Introduction 
Hyphenator programs in modern typesetting systems are necessary to eliminate ex- 
cess space between adjacent words in texts. Word hyphenation could be bypassed by 
stretching out this space, but this would effect the appearance of the document. A 
hyphenator program takes as input a word and returns the set of points within the 
word where hyphens are permissible. Word hyphenation depends strictly on the target 
natural language, and many of the problems encountered are language specific. 
In general, machine hyphenation can be achieved either by consulting lists of 
hyphenated words or by developing pattern-based hyphenation programs (Liang 1983; 
Knuth 1986). The first approach ensures complete and correct hyphenation, but it has 
the disadvantage of being incapable of hyphenating words not on the list. In particular, 
for highly inflectional languages, such as Greek, these word lists would have to be 
extremely extensive in order to include all possible inflectional and derivational word 
forms. Even if such lists could be generated, it would be impossible to include words 
such as compounds, which can be readily created, or all proper names. In addition, 
the initial step toward the development of lists of hyphenated words is commonly 
rule-based hyphenation. On the other hand, although the second approach does not 
raise such problems, it has the disadvantage of being unable to guarantee complete 
and accurate hyphenation. 
The aim of the present study has been to analytically examine Modern Greek 
hyphenation in order to develop a pattern-based hyphenator. The requirement specifi- 
cations are defined as follows: (i) to strictly prohibit impermissible hyphen generation; 
(ii) to generate a hyphen list that is as complete as possible. 
Existing hyphenator programs meet the first requirement either by decreasing the 
number of proposed hyphens or by establishing stop lists containing the appropriately 
hyphenated exceptional words. Commonly, fulfilling the second requirement depends 
on the development of extensive subword patterns associated with hyphenation rules, 
as in Liang 1983, for example. Establishing lists of exceptions has the same disadvan- 
tages as the approach to hyphenating through consulting lists of hyphenated words, 
* Computer Technology Institute, 3, Kolokotroni str., 26 223 Patras, Greece. E-mail: noussia@cti.gr 
Q 1997 Association for Computational Linguistics 
Computational Linguistics Volume 23, Number 3 
and thus hyphenator dependencies on lists of exceptions must be restricted as much 
as possible. 
Native Greek speakers are able to hyphenate most Greek words fully and unam- 
biguously. In extreme cases, they will propose two hyphen sets for the same word, one 
being a proper subset of the other, but both being acceptable. However, complete au- 
tomatic hyphenation is a rather complex task. Although consonant splitting is clearly 
determined by the grammar rules of Modern Greek and is thus easily expressed in 
terms of non-exceptional formal patterns associated with specific hyphenation rules, 
vowel splitting is not. The main problem of vowel splitting is that the grammar in- 
dicates the cases where splitting is not allowed, and the splitting of a large number 
of these cases is ambiguous. In addition, Greek vowels are sometimes accented, so 
ambiguity resolution concerns thousands of vowel sequences. 
Existing hyphenator programs for Modern Greek are available as either commer- 
cial or research-based products and usually work on a minimal basis, i.e., finding only 
hyphenation points of consonant sequences. Some research-based versions, includ- 
ing one application of the TEX (Knuth 1986) hyphenator for Greek, achieve improved 
hyphenation but cover a minimal subset of the vowel sequences. 
The present paper will encapsulate the standard grammar hyphenation rules and 
the general principles used in this study. The paper expresses these rules (which 
focus mainly on consonant sequences) formally and points out their limitations in 
terms of formal word expressions that can be completely and correctly hyphenated. 
The paper then turns to the problem of vowel splitting, and, by formally examining 
prohibitive grammar rules, deduces general hyphenation rules. It presents additional 
heuristic rules discovered during an exhaustive search of ambiguous vowel patterns, 
and demonstrates the degree of the resolved ambiguity in terms of the number of 
vowel sequences that have been disambiguated. Implementation issues are discussed, 
as well as the problem of words written in uppercase letters. Finally, the paper outlines 
the potential for generalization to other languages. 
2. Hyphenation Rules 
2.1 Consonant Splitting 
According to KEME (1983) 1, the splitting of a Modern Greek word into syllables is 
governed by the following rules: 
C1. 
C2. 
C3. 
A single consonant between two vowels is hyphenated with the 
succeeding vowel. 
A sequence of two consonants between two vowels is hyphenated with 
the succeeding vowel, if a Greek word exists that begins with such a 
consonant sequence. Otherwise the sequence is split into two syllables. 
A sequence of three or more consonants between two vowels is 
hyphenated with the succeeding vowel, if a Greek word exists that 
begins with the sequence of the first two consonants. Otherwise it splits; 
the first consonant being hyphenated with the preceding vowel. 
The output of a hyphenator program is a set of permissible hyphen points within 
the input word. In order to specify this set, we shall proceed to a formal interpretation 
1 This book is the official grammar book of Modern Greek edited by a group of experts and it is a 
revised edition of Triantafillidis (1941, reprint with corrections 1978). 
362 
Noussia Hyphenator for Modem Greek 
of the grammar rules. As can easily be observed, the grammar rules are pattern based. 
Thus, the input word is divided into substrings, and the corresponding rules are 
applied to the substrings. Specifically, the goal is to identify the regular expressions 
of the patterns and the exact hyphen points for each formal pattern. Ambiguity issues 
caused by the interpretation of the grammar rules will be resolved. We will also prove 
that rules C1-C3 are not sufficient to provide complete hyphenation coverage of Greek 
words. This study will be based on rules C1-C3 and, in addition, the informal definition 
of a syllable as consisting of at least one or more vowels, or vowel(s) accompanied 
with one or more consonants (Triantafillidis 1978, 38), shall be adopted. 
Let V be the set of vowel characters, C the set of consonant characters, v E V, and 
c E C. Specifically, V = {~, ~, 7/, 5, 0, v, a;, d, G ~, E, 6, z~, ~3, ~', ~3, T, ~7}, C = {fl, % 6, 
~, 0, ~, A, #, ~, ~, 7r, p, or, T, ~, X, ¢}. Subscripts, e.g., Vl,V2, Cl, C2 are used in order 
to distinguish more than one vowel or consonant of the same pattern. The beginning 
or end of the input word is indicated by the symbol "o". Optionality is indicated by 
placing characters inside square brackets. The operation obtaining one or more strings 
is denoted by the symbol "+'. The operation obtaining zero or more strings is denoted 
by "*', or Kleene star (Lewis and Papadimitriou 1981). 
We shall begin with the formal representation of the grammar rule subword pat- 
terns. The substrings of rules C1, C2, and C3 constitute one or more consonants be- 
tween two vowels, or the strings of the expression vlc+v2. Let cl be the first (obligatory) 
consonant in the consonant sequence of that expression. Let c2 and c3 be the second 
and the last (optional) consonants of the same sequence. Thus, the expression can be 
written as VlCl\[C2C*Cg\]V2. Therefore: 
Lemma 1 
The substrings of grammar rules C1, C2 and C3 are contained in the set of the expres- 
sion VlC 1 \[C2C*C3\]V 2. 
Grammar rules C1, C2, and C3 determine the hyphenation of word substrings 
comprising embedded consonants between vowels. They do not apply to substrings 
containing initial or final consonants. According to the informal definition of syllable 
given above, a syllable has at least one vowel and thus the consonant prefixes and 
suffixes of a word cannot constitute entire syllables. In other words, the maximal 
consonant prefix of a word is always hyphenated with the following vowel and the 
maximal consonant suffix of a word is always hyphenated with the preceding vowel. 
The permissible hyphen points of words are located between the syllables, thus: 
Lemma 2 
(a) The point following the maximal consonant prefix of a word and (b) the point pre- 
ceding the maximal consonant suffix of a word do not constitute permissible hyphen 
points. 
The substrings of lemma 2(a) comprise the set of all maximal prefix and conso- 
nant sequences of words. Formally, the set of expression 0C1\[C2C*C3\] is the set of all 
maximal prefixes of consonants. Respectively, c1\[c2c*c3\]e is the expression for the set 
of substrings of lemma 2(b). Thus: 
Lemma 3 
The consonant substrings of Lemma 2(a) and 2(b) are contained in the sets of expres- 
sions ocl \[c2c*c3\] and C 1 \[C2C*C3\]e, respectively. 
363 
Computational Linguistics Volume 23, Number 3 
Table 1 
Consonant patterns and hyphenation rules. (C = {/3, % 6, 
¢, O, ~, ,\, #, ~,, {, ~r, p, a, ~, r, (t, X, ~}, CC = {/3% f16, /3,\, 
tip, 3'6, 7~, 7A, 3'v, "yp, 6p, OA, Or, Op, hA, ~v, ~p, ~r, #v, 
#Tr, vr, 7rA, Try, ~rr, aft, a7, an, a#, a~r, at, a(t, aX, TG Tit, 
Tp, Ta, (tO, (tT, (tA, (tip, XS, XT, XA, XP, XP}) 
Pattern Condition Hyphenation 
Cl VlCI\[C2C*C3\]V2 C1\[C2\] C CCUC Ul-Cl\[C2C*C3\]V2 
c2 plCl\[C2C*CB\]V2 c1\[c2\] ~ CCUC VlCl-\[C2C*CB\]V2 
Now, let us formally specify the hyphen points as indicated by the grammar rules. 
According to lemmata 2 and 3: 
Lemma 4 
(a) For all words containing a substring OCl \[c2c*c3\], the point immediately following 
c3, and (b) for all words containing a substring cl \[c2c*c3\]o, the point immediately pre- 
ceding cl, are impermissible hyphen points. 
In contrast, C1, C2, and C3 specify permissible hyphen points. However, two dif- 
ferent interpretations can be given, namely that (i) only one hyphen point is specified 
by the rules, i.e., the point preceding or (exclusively) following the first embedded 
consonant Cl or that (ii) two additional hyphen points are permissible: those preced- 
ing the first and following the second vowel. Both interpretations specify one common 
permissible hyphen point, which is, therefore, non-ambiguous. To define this point 
formally, let CC be the set of consonant sequences, two characters in length, that can 
begin a Greek word. (The exact definition 2 of set CC is given in Table 1. This set was 
extracted by the extensive listing of the initial word syllables presented in Setatos 
\[1971\]). 
Theorem 1 
The strings of the expression VlCl \[C2C*C3\]V2 are hyphenated as Vl - Cl \[c2c*c3\]v2 if c1\[c2\] C 
CC U C. Otherwise they are hyphenated as VlCl - \[c2c*c3\]v2. 
Proof 
Rule C1 indicates that the strings of expression VlClV2 are always hyphenated as Vl- 
CLV2. These strings are a proper subset of VlCl \[c2c*c3\]v2 and they do not contain con- 
sonants C2C*C 3. Thus, c1\[c2\] is degenerated to cl, while Cl c C by definition, and hence 
VlCl \[C2C*C3\]V2 are always hyphenated as vl - c1\[c2c*c3\]v2. 
The remaining strings are VlClC2\[C*C3\]V2, and, as indicated by C2 and C3, their 
hyphen point is the point preceding Cl if ClC 2 E CC or the point between Cl and c2 
otherwise. \[\] 
2 Some books state that three consonant sequences, namely/~:r, vr, and "1'~ (/b/, /d/, /g/) are excluded 
from set CC under specific contexts, see for example, paragraph 81, note 4 of Trantafillidis (1941) and 
paragraphs 140, and 141 of Tsopanakis (1994). The official grammar book (KEME 1983) however, does 
not treat these sequences as exceptional. 
364 
Noussia Hyphenator for Modem Greek 
Theorem 2 
The points immediately preceding Vl and immediately following v2 in the strings of 
expression VlCl\[C2C*Cg\]V2 do not necessarily constitute permissible hyphen points. 
Proof 
Suppose that grammar rules indicate that the points immediately preceding vl and 
immediately following v2 are also permissible hyphen points. By further taking into 
account Theorem 1, the assumption is that VlCI\[C2C*Cg\]V2 is hyphenated either as - vl - 
CI\[C2C*C3\]V 2 -, or exclusively as - VlC 1 - \[c2c*c3\]v 2 -. 
A hyphen is not permitted at the beginning or the end of the word, thus the 
possibility that the substring is located at the beginning or the end of the word is by 
definition excluded. Consequently, there is at least one character preceding vl and one 
following v2. Consider the case where a consonant or consonant sequence precedes v2. 
If it is at the beginning of the word, according to Lemma 4(a) the hyphen cannot be 
inserted after the consonant(s) and hence the assumption of a hyphen before vl is false. 
Respectively, for the case of a final consonant, or consonant sequence after v2, according 
to Lemma 4(b) a hyphen following v2 is not permitted. Now consider the case of a non- 
initial consonant or consonant sequence preceding vl. In this case, Theorem 1 specifies 
one non-ambiguous hyphen point, which will not always be the point preceding vl; 
hence the assumption is again, false. For the case of a nonfinal consonant sequence 
following v2, the point implied by Theorem 1 may, in certain contexts, indicate the 
same point as the assumption, but in other contexts it may not. Nevertheless, in both 
cases the correct point will always be specified, thus the assumed rule does not need to 
be reapplied in order to indicate a potentially impermissible hyphen. Cases of a vowel 
preceding Vl or following v2 remain to be examined. In these cases, Theorem 1 does 
not define additional hyphen points, and Lemma 4 does not indicate impermissible 
hyphen points. The issue to examine is vowel splitting independent of consonants. 
As we shall see in the following section, vowel splitting is not always permissible. 
To present the proof in its entirety, it would be sufficient to give two contradictory 
examples, where the points preceding Vl and following v2 are not permissible hyphen 
points, e.g., c~v-A~ \[av-hl 'courtyard', and :rc~ - A~& \[pa-li6s\] 'old'. 
Therefore, the assumption does not always hold. \[\] 
A summarized formal definition of the hyphenation patterns and their associated 
rules as discussed above is presented in Table 1. Theorem 2 gives further support to 
the proposition that grammar rules are not capable of completely hyphenating all NL 
words. 
Theorem 3 
Rules presented in Table 1 are sufficient to completely hyphenate all words containing 
no consecutive vowels. 
Proof 
Every syllable has at least one vowel, thus a word cannot have syllables exceeding 
the number of its vowels, and it cannot have fewer syllables than the number of non- 
ending maximal consonant sequences. Let n be the number of vowels in a word not 
containing consecutive vowels. Then, if the word begins with a consonant or consonant 
sequence the number of non-ending maximal consonant sequences is n, or otherwise, 
n- 1. Consequently, all such words have exactly n syllables. According to the definition 
of a hyphen, these words have exactly n - 1 hyphen points. 
365 
Computational Linguistics Volume 23, Number 3 
According to Theorems 1 and 2, for each substring VlCl\[C2C*C3\]V2, precisely one 
hyphen point can be derived. All words containing n vowels, none of which are 
pairwise consecutive, have exactly n - 1 substrings of the expression VlCl\[C2G*GB\]V2, 
and according to Theorem 1, for each of these, one hyphen point can be derived. 
Therefore, for all words containing no consecutive vowels, precisely n - 1 hyphens 
are derived, and thus the rules of Table 1 are sufficient to completely hyphenate these 
words. \[\] 
2.2.1 Elimination of consonant sequences and loanword hyphenation. The examina- 
tion of consonant splitting has not set any restrictions on the maximum length or even 
on the existence of certain consonant sequences. The phthong sequences that Modern 
Greek permits are, however, restricted by principles of grammar that are assumed to 
be universal. In the case of consonants, these principles state that the maximal conso- 
nant sequence is four characters long, and the maximal consonant prefix and suffix of 
Greek words are three characters and one character long, respectively (Setatos 1971). 
If these principles had been used in the examination of consonant splitting, the set of 
all subword patterns of Table I would have been restricted to the set of expression 
VlC 1\[c2C3c4\]v 2. Similarly, prefix and suffix consonant sequences would be restricted to 
OCl \[c2c3\] and c°, respectively. Furthermore, restrictions of specific sequences not possible 
in Greek words would further confine these patterns. Loanwords sometimes challenge 
these principles. Loanwords have been incorporated into Greek since ancient times and 
include words that cannot easily be recognized as borrowed, because of their adap- 
tation into the above principles. Other loanwords, most frequently words that end 
in more than one consonant, e.g., ~L,~# \[film\] 'film', have not completely adapted. A 
sequence of more than three consonants at the beginning of the word is also possible, 
as in the word F~r,7o~r, cr~ \[gdansk\] 'Gdansk' (the city in Poland), although it is quite 
infrequent. Cases of more than four consonants may also exist or might appear in 
new loanwords or, most likely, occur in artificial words such as tongue twisters. Loan- 
word hyphenation is governed by the same grammar rules as the rest of the language. 
Thus, in order to cover hyphenation of such loanwords, the patterns of Table 1 must 
not be eliminated. Apparently, this means that loanword hyphenation is independent 
of the rules governing hyphenation in the original language from which the word was 
borrowed. For example, although no hyphen point is derived by the Greek rules for 
the loanwords 9~LA# \[film\] 'film', ~-~ \[tank\] 'tank', words having common deriva- 
tives of these, such as ~,kp6n~ \[filmdki\] 'small film' and "r&r,~ep \[t~nker\] 'tanker' are 
hyphenated as ~L&-#6-~c \[fil-md-ki\] and "rdr,-r~cp \[t~n-ker\]. 
2.2 Vowel Splitting 
As already discussed, the rules presented in Table 1 cover hyphenation of word sub- 
strings containing at least one consonant; cases of vowel splitting are not covered. 
Vowel splitting is quite common in Modern Greek and is usually handled in grammar 
books with prohibitive rules. These are included within the context of the definition 
of various vowel combinations, but are rarely explicitly included within the set of 
standard hyphenation rules. 3 
Before proceeding to the presentation and analysis of these rules, some terms that 
will be used need to be defined. It should be noted that the terminology used refers to 
3 Vowel sequences are sometimes explicitly mentioned in hyphenation rules, but usually only in the 
context of consonant sequences. For example, all references to vowels in rules about splitting of 
consonants may be augmented with "(or diphthongs)." This is not sufficient because vowel sequences that are not next to consonants may split, as in IIo~roM/-w-&~-vov \[papa-i-o-~in-nou\]. 
366 
Noussia Hyphenator for Modern Greek 
the orthographic representation of the various word substrings. Phonetic transcriptions 
are presented for the reader who is not familiar with Greek. Although phonetics is the 
ultimate basis for hyphenation, our approach is based on the available data, which 
is the orthographic representation of words, and not a transcription in a phonetic 
alphabet such as IPA. 
As it has been previously defined, the term vowel refers to a single vowel or 
vowel character; V is the set of vowels. Double-vowel blends are phonetically equiv- 
alent to vowels, and their orthographic representation comprises two vowel charac- 
ters. Let 2V be the set of double-vowel blends, 2V = {a~, e~, oL, w, or, c~E, cE, oE, vi, o~;} 
({ \[c\], \[i\], \[i\], \[i\], \[u\], \[g\], \[E\], \[~q, \[ti\] }). Some two-vowel orthographic combinations are pho- 
netically equivalent to a vowel-consonant sound. Let VC be the set of such two- 
vowel combinations, VC = {c~v, cv, z/v, az~, ~, z/~} ({ \[av\], levi, \[iv\], \[~v\], \[~v\], \[iv\] }). Fi- 
nally, diphthongs and excessive diphthongs 4 are vowel sequences consisting of two 
parts; each part can comprise either a vowel or a double-vowel blend. Precisely, the 
set of diphthongs and excessive diphthongs is a proper subset of 0Clf2: fl,f2 E V U 2V}. 
The prohibitive hyphenation rules regarding vowel splitting are as follows: 
V1. Double-vowel blends do not split. 
V2. The combinations av, ev, z/v, a4, ~4, and Z/~ do not split. 5 
V3. Diphthongs do not split. 
V4. Excessive diphthongs do not split. 
All of the above rules are negative in that they indicate impermissible hyphen points 
within particular substrings of consecutive vowels. As the goal of the hyphenator is to 
identify the permissible hyphen points, we interpret V1, V2, V3, and V4 complemen- 
tarily, i.e., in all other cases, splitting is allowed. It is important to note that the ultimate 
goal is to specify the permissible hyphen points in any vowel sequence, and not only 
in the particular substrings of sequences mentioned in V1, V2, V3, and V4. Formally, 
for every vowel sequence v0... v,_\] of n vowels, and its corresponding set of points 
Pvo...v,_, = {hi: hi is the point between vi and vi+l, 0 < i < n - 1}, the issue is to identify 
set IPvo...v,_~ c_ Pvo...v,_1 of the impermissible hyphen points. Then the set PPvo...v,._7 of 
the permissible points will be their set difference, or PPvo...v,++ = Pvo...v,_7 - IPvo...v,_+. 
Let us first formally specify the impermissible hyphen points in the particular 
sequences of V1-V4 rules. The combinations contained in V1, V2, V3, and V4 are 
distinguished in terms of their constituent elements. All combinations are made up of 
two parts; both parts of double-vowel blends and combinations c~v, cv, etc. of rule V2 
are vowels, while both parts of diphthongs and excessive diphthongs can be either 
vowels or double-vowel blends. Therefore, the impermissible hyphen point is located 
between the two parts in all combinations. For double-vowel blends and the elements 
of VC, which are digrams by definition, the impermissible hyphen point falls between 
its two constituent vowels, or 
IPvo~ c 2VuVC = {ho} = P~ov+ ~ 2vuvc hence PPvov+ c 2vuvc = ~. (1) 
Therefore, no additional hyphen point is derived for any word where each vowel 
sequence is of either the 2V or the VC type. Consequently, Theorem 3 is augmented 
4 Diphthongs and excessive diphthongs will be defined operationally in the next pages. 5 The zlv combination is infrequently referred to in grammar books (KEME 1983), possibly because it 
appears in only a small number of words. However, this combination is also considered, because such 
words are regularly used e.g., ~r/~pa \[efivra\] 'I invented'. 
367 
Computational Linguistics Volume 23, Number 3 
to apply to words containing a maximum of two vowel substrings that are elements 
of 2V or VC. 
Lemma 5 
The rules presented in Table 1 are sufficient to completely hyphenate all words in 
which each vowel sequence is included in set Vl Iv2\], such that Vl Iv2\] E V U 2V U VC. 
For words containing at least 'one n-gram vowel sequence, with n > 2, it is not 
apparent which vowel pairs, if any, will constitute a double-vowel blend or a VC 
so that the associated negative rules can be applied. Furthermore, diphthongs and 
excessive diphthongs comprised of either digrams consisting of two vowels, or tri- 
grams consisting of a vowel and a double-vowel blend, or tetragrams consisting of 
two double-vowel blends, need to be precisely separated before the rules are applied. 
This procedure is called tokenization (see for example, Aho et al. \[1986\]). Tokenization 
in this case takes as input a vowel sequence and returns a sequential list of maximal 
non-overlapping tokens of the types 2V, VC and V. Tokens do not overlap in that every 
vowel of the sequence is assigned to one and only one token. Tokenization might 
be ambiguous in that it might generate alternative token lists for specific vowel se- 
quences. More precisely, alternative token lists can be generated for sequences where 
a vowel can be associated to its left or its right neighboring vowel in order to build up 
a 2V or VC token. 6 However, tokenization is achieved unambiguously because vowels 
are examined from left to right and a concrete token of the V type is extracted if it 
does not form a double-vowel blend or an element of the VC set with its subsequent 
vowel. Otherwise, a token of 2V or VC type is extracted. In conclusion, 2V and VC 
are disjoint, thus tokenization results in a unique list of tokens. 
Let any vowel sequence v0... Vn-1 of n vowels, and its k-token sequencer0...fk-1, 
k < n, fj c V U 2V U VC, 0 <_ j < k - 1, and let Pd0...f~-i = {hi I hj is the point between 
)~ and J)+l, 0 G j < k - 1}. Let also IPfo...fk_ ~ and PPfo...fk_~ be the sets of impermis- 
sible and permissible hyphen points of the token sequence, respectively. Obviously, 
PPfo...fk-~ = PPvo...v,_~ and Pvo...v,-1 D Pfo...fk-~" According to (1), the elements of their 
set difference Pvov,...v,_~ - Pfofl...fk-~ are all impermissible. Thus, the points that remain 
to be examined in regard to their hyphen permissibility are elements of set Pfo...fk_,. 
This examination will be directed by V3 and V4 prohibitive rules. To conclude, formal 
definitions of diphthong and excessive diphthong sets would suffice. In this case, the 
specification of permissible hyphens would be based on whether each sequence of 
pairwise consecutive tokens is an element of one of these sets. 
Identification of diphthongs and excessive diphthongs is a difficult task because 
of the ambiguity that arises when attempting to make specific designations. There are 
extreme cases where sequences exist whose assignment as diphthongs is context de- 
pendent. Some instances remain ambiguous even within precise contexts. Specifically, 
they may or may not be labeled as diphthongs, depending on the specific dialect or on 
the personal preference of the native speaker. To deal with this problem formally we 
shall determine weaker boundaries of diphthongs and excessive diphthongs. When 
considering hyphenation in regard to diphthongs, the problem is that diphthong def- 
initions are circular, as in Triantafillidis (1978, 33), who states that "two vowels 7 that 
6 As a matter of fact, the only sequences that might be problematic in tokenization are in the set of 
expression {c~ I ~ I ~/ I o} {v I v} {5 I g}. However, the algorithm ensures that the second vowel v or 
will be associated with the first. For example, the substring ov£ in the word fl¢6ovEvo¢ \[ve6ufnos\] 
'Bedouin' is separated as ov and E and not as o and vL 
7 Double-vowel blends are included in this excerpt. 
368 

Computational Linguistics Volume 23, Number 3 
Table 2 
Hyphenation rules of vowel patterns. (V = {a, e, 7, ~, o, v, :v, 6, ~, 4, E, 6, 4, ub, \[', ~), ~, ~}, 
I = {~, t, v,/I, L G "t', ~, ~; ~, et, or, vt, eL oE, vE}, U = {or, o(J}, 2V -- {st, et, or, vt, ov, 
c~E, eE, oE, rE, o~}, VC = {~v, ev, ~v, ~(2, e(;, ~(2}, ID = {'\[', "t', g, ©}, lust ..... d ~-~ {/~, E, Z), "~'r ~, 
eg, Og, Vg, 0(2}, IN1 = {'E; i7, 43}, R = {p}, II = {tt, rE, t~?, t~, vT1, v~ , vei, oral, ot~, teE}, 
Y1 = (~,}, vY = {~, 6~, ~, ,6}) 
Pattern Condition Substitution 
F1 flf2 fl,f2 E V U 2V - (I t3 U) fl -f2 
F2 flf2 fl E V U 2V U VC A f2 E VC .1:1 - f2 
F3 flf2 fl E VC A f2 E V U 2V .1:1 - f2 
F4 f~f2 fl •IUst ..... d A f2 e V U 2V A - f2 
Fs f, f2 .1:2 E V u 2V - (IU U) A f2 e lUst .... d fl -f2 
F6 f, f2 fl e V U 2V A f2 E ID1 f, - f2 
F7 flf2 .1:1 • IDA f2 • V U 2V .1:1 - f2 
F8 f, f2 flf2 E VY A-f2 
F9 flf2 fl E (V U 2V) -IAf2 E (IU U) M 2V .1:1 -.1:2 
F,o flf2 .1:1 E U A f2 C V U 2V fl - f2 
F~, f,f~ fd~ ~ U v q~ • YI A f~ ~ V U 2V) f~ -5 
F12 C1¢2flf2 fl E I U U A f2 e V U 2V A cI E C A c2 C R ClC2fl - f2 
2.2.1 Diphthong Identification. In the previous section, vowel splitting was formally 
examined and concrete hyphenation rules were derived. However, as Lemma 6(c) 
explicitly acknowledges, hyphenation is restricted by diphthongs and excessive diph- 
thongs. In this section, we shall proceed to an empirical examination of diphthongs 
and excessive diphthongs. Taking into account the initial specification that the hyphen- 
ator should never generate non-acceptable hyphens, and in order to pare down the 
enormous sets of candidate diphthongs and excessive diphthongs, we need to isolate 
the subset of sequences for which splitting is always permitted. 
The approach followed was to first select all sequences of the above sets that 
were mentioned in various grammar books as examples of diphthongs and to assign 
them to the category of "neversplitting sequences." Then experimental matches were 
conducted through an electronic dictionary of Modern Greek that encodes 100,000 
lemmata and all their inflectional and derivational forms (Vagelatos et al. 1995), and 
a 13 Mbyte corpus of newspaper articles. By definition, this process could not be au- 
tomatic because hyphens were not included in the lexicon or the corpus, but there 
were far too many matches to be examined manually. Manual examination was re- 
stricted to those matches having limited frequency of occurrence. Nevertheless, during 
this process, a systematic method of identifying additional nonsplitting sequences was 
discovered based on a rule for stressing that states that a stress mark can only be ap- 
plied to the ultimate, penultimate, or antepenultimate position of a word. Words in 
the lexicon were hyphenated based on the assumption that all remaining candidates 
do split. This hyphenation, however, resulted in certain words whose stress appeared 
on a syllable to the left of the antepenultimate position. Apparently then, incorrect 
hyphenation had been applied. All diphthong and excessive diphthong candidates 
included in these words were collected and designated nonsplitting sequences. 
For the remaining candidates, identification of particular categories of substrings 
where a general exclusion rule may apply was attempted. Disparate and sometimes 
contradictory views given in various books (Setatos 1971; Triantafillidis 1978; Petrou- 
nias 1984; Mackridge 1987; Tsopanakis 1994) were collected. Their integrity was ex- 
tensively examined through selection of matching words found in the corpus and in 
370 
Noussia Hyphenator for Modern Greek 
the lexicon. This empirical process resulted in formally expressed rules independent 
of any exceptions. The sets of categories found are not necessarily disjoint, whereas 
all overlaps always lead to consistent hyphenation. All categories found are explained 
below, and representative hyphenated examples along with IPA transcriptions and 
translations are given. In order to avoid confusion, hyphenation is applied to those 
vowel sequences corresponding to the category currently being explained, and not to 
the entire word. Formal definitions of all categories are given in Table 2. 
1. Examination of excessive diphthong candidates showed that 50% are 
immediately eliminated, i.e., always split. Specifically, rule F4 states that 
candidate excessive diphthongs whose first part is stressed do always 
split, e.g., 7raL&E-c~ \[p~6i-a\] 'education', ~TopE-a \[istorf-a\] 'history', 
n~-~7~7/ \[~l-isi\] 'pregnancy', fl/c~o~ \[vf-eos\] 'violent', Ae~-o~ \[If-os\] 
'smooth', TpoE-a \[trf-a\] 'Troy'. On the other hand, not all diphthong 
candidates whose second part is stressed split, but the candidates in this 
set that are not simultaneously excessive do always split (rule F5, Table 
2). 
2. Another category is associated with the existence of the diaeresis mark 
on a vowel of either a candidate diphthong or an excessive diphthong. 
All candidates whose second vowel has both a diaeresis mark and a 
stress mark do always split, e.g., Ma-~ov \[Ma-fu\] 'May', 7rpo-©:rap~l 
\[pro-fparksi\] 'preexistence', e~a-©Awcr~/\[eksa-flosi\] 'immateriality'. In 
addition, all candidates having as first or second token a © always split, 
e.g., 6-©~o~ \[~-ilos\] 'immaterial', 7rpo-©rc6OecrT1 \[pro-ip6Oesi\] 'prerequistic', 
AaOpo©-aAovpv& \[la0roi-alurvia\] 'glass smuggling'. As well, candidates 
that have as a first part only a nonstressed ~" always split. Formally, this 
category is defined by rules F6 and F7 (Table 2). Diaeresis marks were 
used as a discriminating factor for additional candidates. The 
single-stress system imposed on Modern Greek in the last decade, states 
that "if the absence of the diaeresis mark does not generate ambiguity 
the mark should be eliminated" (Mackridge 1987, 93). Theoretically, this 
simplification could be applied to a variety of vowel sequences, but 
examination shows that acceptable words containing such sequences do 
not always exist, and not all sequences split. We focus on four that 
always split, namely: w~ - ~a;-~o \[zo-ffio\] 'vermin', 6v - 6-v~o¢ \[~-ilos\] 
'incorporeal, immaterial', tv - apxt-wr~p~rTl¢ \[arxi-ipir~tis\] 'butler', t¢ - 
~rept-(;flpt¢~l \[peri-fvrisi\] 'insult' (rule F8, Table 2). 
3. Another observation is that all diphthong candidates having an ov or o¢ 
as a second part, and whose first part is not in set I always split, e.g., 
nAai-ovaa \[klg-usa\] 'weeping willow', vra-o(;Ata \[da-tilia\] 'drums', 
#a-o(;vc~ \[ma-tina\] 'barge', wpcd-ov¢ \[org-us\] 'beautiful'. Furthermore, the 
category is expanded to include candidates whose second part is a 
double-vowel blend. At this point it should be stressed that there are 
specific examples where the candidate vowel sequence could 
linguistically be considered a diphthong based on pronunciation 
(Triantafillidis 1978, 19). However, during hyphenation they split de 
facto, e.g., ~rd-~L \[p~-i\] 'goes', c~-~L0c~A~¢ \[a-i0alfs\] 'evergreen' (rule F9, 
Table 2). 
4. Rule F~0 is associated with those candidate excessive diphthongs that 
have an ov or o¢ as a first part. Note that the stressed /u/has been 
371 
Computational Linguistics Volume 23, Number 3 
. 
. 
already included in F4 because of the stress mark. Detailed examination 
of the candidates of this category led to the conclusion that the 
candidates always split during hyphenation. Although sometimes they 
are pronounced as diphthongs, they are split de facto, e.g., q)eflpov-&pwg 
\[fevru-~irios\] 'February', flov-71~-6 \[vu-it6\] 'clamor', flov-E~cL \[vu-fzi\] 'it 
clamors', Be5ov-Evo¢ \[vc6u-fnos\] 'Bedouin', Ov-a)~&~ \[u-alfa\] 'Wales', 
o~nov-o#eTp\[o~ \[aku-ometrfa\] 'acoustic metrics'. 
An interesting subset of candidates concerns the intersection of candidate 
diphthong and excessive diphthong sets. This set is (I U U) x (I tJ U) and 
although it comprises a relatively great number of elements, most of 
these have low frequency of occurrence in linguistically acceptable 
words. It should be noted here that some parts of this set have already 
been covered by other rules. For the subset not covered, no general rule 
was formulated but particular instances that always split were identified. 
These instances are covered by rule Fll. We observed that some cases 
present ambiguity, while others always split e.g., &-~a~-d#cuo~' 
\[Si-ist~imcnos\] 'contrary', &-~aTc~#~ \[6i-fstam~\] 'I dissent', &-zlO~#guo~ 
\[6i-iOimgnos\] 'filtered', &-~preLpwT~n6g \[Si-ipirotik6s\] 'intercontinental', 
&-~O~la~l \[6i-f0isi\] 'filtering', &-~77/#c~ \[Si-fTima\] 'short story', #v-zl#guog 
\[mi-imdnos\] 'initiated', #v-~aeL¢ \[mi-fsis\] 'initiate', 7ro~-~-z~g \[pi-itfs\] 
'poet', ~ro~-~ac~g \[pi-fsis\] 'you will do', o~w'o~v-c& \[aftofi-fs\] 'self-grown', 
CmTrAoTro~-¢& \[cpiplopi-fs\] 'furniture-makers', c~,~-eg~ \[ali-fa\] 'fishing', 
w-& \[i-6s\] 'son', w-oO¢ai~ \[i-o0¢sfa\] 'adoption', 6p~rw-c~ \[~irpi-a\] 'harpy'. 
There is a different rule for determining the splitting of excessive 
diphthongs, referred to by both Triantafillidis (1978, 38) and Tsopanakis 
(1994, 108). It concerns the natural semantics of excessive diphthongs; 
the avoidance of hiatus in the spoken language. If the flow of speech is 
constrained by the existence of additional "difficult" or complex 
phthongs, the pronunciation of the excessive diphthong in one syllable 
becomes impossible. One such case is that of at least a double-consonant 
sequence, whose second consonant is p \[r\] followed by a candidate 
excessive diphthong. That diphthong is not excessive and should always 
be split (rule F12, Table 2). 
It should be noted that additional rules covering additional vowel sequences under 
specific contexts have been found and examined. For example, candidate diphthongs 
located between the members of compound words prefixed by a preposition do not 
split. The automatic identification of these instances would be based on a morpholog- 
ical analysis of words, a process beyond the scope of the present analysis. 
2.2.2 Elimination of non-existent sequences. Having completed the vowel splitting 
study, the question of whether all sequences presented in the rules of Table 2 exist 
within acceptable Modern Greek words arises. Eliminations of consonant patterns ex- 
ceeding a maximum length have already been discussed. Eliminations based on the 
existence of certain vowel sequences may be possible. However, ancient Greek words 
and borrowed foreign words that are frequently used in both written and spoken forms 
contain additional sequences and, as has already been mentioned, their hyphenation is 
governed by the same rules. Nevertheless, vowel sequences that contain consecutive 
stressed vowels or double-vowel blends, or consecutive vowels with diaeresis marks 
do not exist in any word--pure Greek or loan--and thus this can be used as a general 
372 
Noussia Hyphenator for Modern Greek 
elimination principle. The patterns flf2,fl,f2 C V U 2V U VC of Lemma 6 contain exactly 
301 such sequences. From the remaining vowel sequences of Lemma 6, a few may be 
identified as non-existent. However, ad hoc compounds that can be readily created 
may contain even those sequences. Mackridge (1987) notes that, unlike with English, 
a person fluent in Greek has no difficulty in pronouncing an unknown word. This 
holds for all vowel sequences in Greek independently of whether they exist within 
acceptable words. It was thus decided to examine all theoretically possible cases and 
not to eliminate a priori any sequences. 
2.3 Degree of Hyphenation Completeness 
The rules in Tables 1 and 2 guarantee 100% correct hyphenation. The rules in Table 1 
are capable of locating all permissible hyphenations of consonant sequences. In regard 
to vowel sequences, set (V U 2V U VC) has 34 elements and according to Lemma 6 
complete hyphenation of vowel sequences depends on 342 = 1,156 vowel sequences. 
Grammar rules V1 and V2 explicitly define 16 of these, namely the elements of sets 
2V and VC, while grammar books refer to 8 diphthongs that never split. Hence, only 
16 + 8 = 24 sequences were initially non-ambiguous, while 1,156 - 24 = 1,132 were 
ambiguous. Rules F1-F11 (Table 2) resolve the ambiguity of 1,015 different patterns. 
(Occurrences of overlapping patterns have been eliminated by analytically calculat- 
ing the intersection of the sets of patterns for all pairs of rules F1-F11). In general, 
1,156 - 24 - 1,015 = 117 remain ambiguous. Thus, these rules are capable of com- 
pletely hyphenating at least (1,015 + 24/1,156)'100 = 89.9% of the 1,156 sequences. 
(If non-existent patterns were eliminated, i.e., those consisting of either two consecu- 
tive stressed vowels or of two consecutive vowels with diaeresis marks, the degree of 
completeness of the hyphenator on a vowel pattern basis could be then computed as: 
(1,029- 301)/(1,156- 301)'100 = 85.2%). Taking into account rule F12, which resolves 
ambiguity by proposing additional hyphen points under specific contexts, the degree 
of completeness increases. Furthermore, the ambiguity of additional sequences can be 
resolved without proposing additional hyphens, by using the rule stating that stress 
cannot be applied to a syllable beyond the antepenultimate position. 
The degree of completeness calculated above does not represent completeness in 
terms of hyphenated words of real text corpora. The degree of complete hyphenated 
words of newspaper texts was manually calculated to be over 99%, as expected, be- 
cause the frequency of occurrence of the remaining ambiguous vowel sequences in 
words of real texts is relatively low. 
3. Implementation 
In the previous sections, hyphenation issues were examined as they pertain to Mod- 
ern Greek with the goal of achieving machine hyphenation that is both accurate and 
complete to the highest degree possible. 
Existing hyphenators for Greek are commercial products and usually work on a 
minimal basis, i.e., finding the hyphen points of consonant sequences and, in lim- 
ited cases, hyphens of vowel sequences. A research-based version of the Greek TEX 
typesetting system (Knuth 1986) provides improved hyphenation, but it only indi- 
cates splitting for 7.1% of the vowel sequences, which seem to have been selected 
rather intuitively. Furthermore, three of the sequences, as was observed, can generate 
impermissible hyphens. 
The rules presented here have been used for the development of a hyphenator 
program included in the Microsoft Word for Windows 6.0 and 7.0 (Greek version) al- 
ready on the market. The system has also been ported to different platforms including 
373 
Computational Linguistics Volume 23, Number 3 
Lotus AmiPro and a specialized typesetting system of a major Greek newspaper. The 
formal rules and the exact definitions of the sets of vowel and consonant sequences 
compiled in Tables 1 and 2 are sufficient to implement the hyphenator program. Pat- 
terns in Table 2 constitute maximal vowel tokens, which can be derived by a lexical 
analysis process, while patterns in Table 1 consist of single vowels and consonants. 
The hyphenator program comprises two parts: the lexical analyzer and the actual 
hyphenator. The lexical analyzer reads the input characters and produces as output a 
sequence of maximal V, 2V and VC tokens, as well as tokens of the maximal consonant 
sequences of the word. For all tokens, the absolute starting position of the token in 
the input word is maintained, while the length of each token is implicitly defined by 
the token itself. All consonant tokens are also subdivided according to whether their 
two character prefix is contained in the CC set or not. Nontrivial consonant sequences 
are also designated by a flag indicating the occurrence of a p \[r\] suffix. 
Vowel tokens are further classified according to the nearby resident vowel and 
consonant tokens. No additional classification of vowel tokens is needed in the fol- 
lowing cases: (i) vowel tokens not in the IUU set; (ii) vowel tokens that appear between 
any consonant sequences; (iii) stressed vowel tokens in the I U U set that have as a left 
neighbor a consonant sequence with an/r/ suffix; (iv) vowel tokens that simultane- 
ously have stress and diaeresis marks. The remaining vowel tokens are characterized 
explicitly as stressed I, nonstressed I, and U. 
The actual hyphenation phase follows, where the hyphenator traverses the token 
sequence, identifies all ordered sequences of type (a) Ivowel token I - Iconsonant token I 
- Ivowel token I, and (b) Ivowel token I - Ivowel token I, and applies the corresponding 
hyphenation rules. The resulting hyphen points are given in terms of the absolute 
starting position in the word of the first or the second token of the sequence currently 
being examined. 
3.1 Hyphenation of Words in Uppercase 
There is no one-to-one correspondence between uppercase and lowercase letters. The 
main difference is that stress markings are not applied to words whose letters are all 
written in capitals while the diaeresis mark is maintained in capital letters, u Conse- 
quently, the transformation of any uppercase word to lowercase and back to uppercase 
again loses no information. The opposite transformation is not always without loss 
of information. To decrease the complexity of the hyphenator, we used only lower- 
case patterns. Thus, uppercase words are transformed to lowercase, hyphenated, and 
transformed back to uppercase forms. 12 
Hyphenation patterns of consonant sequences (Table 1) are unchanged because 
consonants do not take stress marks and, moreover, the vowels contained in these 
patterns are independent of stress. On the other hand, many of the patterns derived 
for the hyphenation of vowel sequences cannot be applied to capitalized words be- 
cause the most important discriminating factor in diphthong identification is stress 
marking, and uppercase letters (Section 2.2.1) lack stress markings. This observation 
certainly implies the tendency for words in uppercase to have fewer hyphens than 
their lowercase equivalents. This inconsistency cannot be resolved without additional 
information about the position of the stress mark. 
11 In words written with both capital and lowercase letters, an initial capital letter may have a stress mark. 
12 The tranformation takes into account the existence of a final \[s\] in the uppercase word and tranforms it 
to the final ~ instead of or, according to a corresponding transformation rule. 
374 
Noussia Hyphenator for Modern Greek 
4. Discussion 
Overall, it was feasible to make an analytical examination of the hyphenating sys- 
tem mainly because most of the known hyphenation properties were expressed or 
could be expressed in terms of orthographic representation. In Greek, this representa- 
tion contains much of the pronunciation information, which is the ultimate basis for 
hyphenation in every language. When analytical work reached the point where the 
available data could no longer provide the necessary pronunciation information, it 
was replaced by empirical work. 
A similar process would be difficult to conceive in languages in which the or- 
thography and pronunciation are significantly different. It should perhaps be stated 
that the system itself may not have the capacity to be generalized to other languages. 
It is interesting to note that rules governing the splitting of subword patterns exist 
in languages such as English, but their application is usually determined by ortho- 
graphically inexplicit information, such as the existence of a long, short, or stressed 
vowel in some position of the pattern. Different types of properties typical of such 
languages as English and German are based on morphological considerations that 
were not an issue for our system. For example, in English "common roots" is an issue 
in hyphenation of compounds, whereas in Greek, it is not. Such properties are not 
likely to be similarly expressed in a pattern-based model. The process of developing a 
similarly performing hyphenator for such languages would be different. Identification 
of certain patterns would presumably be based on an empirical rather than an ana- 
lytical process. Automatic extraction of common hyphenating properties from on-line 
hyphenated dictionaries is known (Liang 1983). The resulting patterns tend to be more 
detailed and extended. Lists of exceptions seem to be obligatory in such an approach 
because their lack would lead to the generation of impermissible hyphens. 
5. Conclusions 
Hyphenation issues pertaining to Modern Greek have been analyzed, and correct and 
thorough machine hyphenation has been achieved as a result of the present study. 
The explicit interpretation and formal expression of specific grammar rules led to a 
formal hyphenation model, and further provided a means of expressing the model's 
limitations. These limitations were in turn examined through an empirical process, 
which also resulted in formally expressed rules. 
Acknowledgment 
The author is grateful to M. Stamison- 
Atmatzidi for her long hours of 
proofreading, to the three CL reviewers for 
their valuable suggestions and comments, 
and to the Greek newspaper To Vima for the 
availability of the text corpus. 
References 
Aho, Alfred V., Ravi Sethi, and Jeffrey D. 
Ullman. 1986. Compilers, Principles, 
Techniques, and Tools. Addison-Wesley. 
KEME. 1983. Revision of Modern Greek 
Grammar of Manolis Triantafillidis (in 
Greek). Didactic Books Publishing 
Organization. 
Knuth, Donald. E. 1986. The TEX Book. 
Addison-Wesley. 
Lewis, Harry and Christos Papadimitriou. 
1981. Elements of the Theory of Computation. 
Prentice-Hall Software Series. 
Liang, Frank M. 1983. Word hy-phen-a-tion by 
computer. Ph.D Thesis, Stanford 
University. 
Mackridge, Peter. 1987. The Modern Greek 
Language. Oxford University Press. 
Petrounias, Evangelos. 1984. Modern Greek 
Grammar-Comparative Analysis. Volume A: 
General Linguistic Fundamentals, Phonetic, 
Introduction to Phonology, Part A: Theory (in 
Greek). University Studio Press, 
Thessaloniki. 
375 
Computational Linguistics Volume 23, Number 3 
Setatos, Michalis. 1971. Phonology of Modern 
Greek Koine (in Greek). Papazisis 
Publishing, Athens. 
Triantafillidis, Manolis. 1941. Modern Greek 
Grammar (Dimotiki) (in Greek). Reprint 
with corrections 1978. Institute of Modern 
Greek Studies, Thessaloniki. 
Tsopanakis, Agapitos. 1994. New Greek 
Grammar (in Greek). Second Edition. 
Athens-Thessaloniki. 
Vagelatos, Aristidis, Theodora 
Triantopoulou, Christos Tsalidis, and 
Dimitris Christodoulakis. 1995. Utilization 
of a Lexicon for Spelling Correction in 
Modern Greek, 10 th Annual Symposium on 
Applied Computing--Special Track on 
Artificial Intelligence, Nashville, TN, 
February. 
376 
