File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/j97-3001_metho.xml
Size: 44,168 bytes
Last Modified: 2025-10-06 14:14:30
<?xml version="1.0" standalone="yes"?> <Paper uid="J97-3001"> <Title>A Rule-based Hyphenator for Modern Greek</Title> <Section position="3" start_page="0" end_page="363" type="metho"> <SectionTitle> 2. Hyphenation Rules </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="363" type="sub_section"> <SectionTitle> 2.1 Consonant Splitting </SectionTitle> <Paragraph position="0"> According to KEME (1983) 1, the splitting of a Modern Greek word into syllables is governed by the following rules: C1.</Paragraph> <Paragraph position="1"> C2.</Paragraph> <Paragraph position="2"> C3.</Paragraph> <Paragraph position="3"> A single consonant between two vowels is hyphenated with the succeeding vowel.</Paragraph> <Paragraph position="4"> A sequence of two consonants between two vowels is hyphenated with the succeeding vowel, if a Greek word exists that begins with such a consonant sequence. Otherwise the sequence is split into two syllables. A sequence of three or more consonants between two vowels is hyphenated with the succeeding vowel, if a Greek word exists that begins with the sequence of the first two consonants. Otherwise it splits; the first consonant being hyphenated with the preceding vowel. The output of a hyphenator program is a set of permissible hyphen points within the input word. In order to specify this set, we shall proceed to a formal interpretation Noussia Hyphenator for Modem Greek of the grammar rules. As can easily be observed, the grammar rules are pattern based. Thus, the input word is divided into substrings, and the corresponding rules are applied to the substrings. Specifically, the goal is to identify the regular expressions of the patterns and the exact hyphen points for each formal pattern. Ambiguity issues caused by the interpretation of the grammar rules will be resolved. We will also prove that rules C1-C3 are not sufficient to provide complete hyphenation coverage of Greek words. This study will be based on rules C1-C3 and, in addition, the informal definition of a syllable as consisting of at least one or more vowels, or vowel(s) accompanied with one or more consonants (Triantafillidis 1978, 38), shall be adopted. Let V be the set of vowel characters, C the set of consonant characters, v E V, and c E C. Specifically, V = {~, ~, 7/, 5, 0, v, a;, d, G ~, E, 6, z~, ~3, ~', ~3, T, ~7}, C = {fl, % 6, ~, 0, ~, A, #, ~, ~, 7r, p, or, T, ~, X, C/}. Subscripts, e.g., Vl,V2, Cl, C2 are used in order to distinguish more than one vowel or consonant of the same pattern. The beginning or end of the input word is indicated by the symbol &quot;o&quot;. Optionality is indicated by placing characters inside square brackets. The operation obtaining one or more strings is denoted by the symbol &quot;+'. The operation obtaining zero or more strings is denoted by &quot;*', or Kleene star (Lewis and Papadimitriou 1981).</Paragraph> <Paragraph position="5"> We shall begin with the formal representation of the grammar rule subword patterns. The substrings of rules C1, C2, and C3 constitute one or more consonants between two vowels, or the strings of the expression vlc+v2. Let cl be the first (obligatory) consonant in the consonant sequence of that expression. Let c2 and c3 be the second and the last (optional) consonants of the same sequence. Thus, the expression can be written as VlCl\[C2C*Cg\]V2. Therefore: Lemma 1 The substrings of grammar rules C1, C2 and C3 are contained in the set of the expression VlC 1 \[C2C*C3\]V 2.</Paragraph> <Paragraph position="6"> Grammar rules C1, C2, and C3 determine the hyphenation of word substrings comprising embedded consonants between vowels. They do not apply to substrings containing initial or final consonants. According to the informal definition of syllable given above, a syllable has at least one vowel and thus the consonant prefixes and suffixes of a word cannot constitute entire syllables. In other words, the maximal consonant prefix of a word is always hyphenated with the following vowel and the maximal consonant suffix of a word is always hyphenated with the preceding vowel. The permissible hyphen points of words are located between the syllables, thus: Lemma 2 (a) The point following the maximal consonant prefix of a word and (b) the point preceding the maximal consonant suffix of a word do not constitute permissible hyphen points.</Paragraph> <Paragraph position="7"> The substrings of lemma 2(a) comprise the set of all maximal prefix and consonant sequences of words. Formally, the set of expression 0C1\[C2C*C3\] is the set of all maximal prefixes of consonants. Respectively, c1\[c2c*c3\]e is the expression for the set of substrings of lemma 2(b). Thus: Lemma 3 The consonant substrings of Lemma 2(a) and 2(b) are contained in the sets of expressions ocl \[c2c*c3\] and C 1 \[C2C*C3\]e, respectively.</Paragraph> <Paragraph position="8"> Computational Linguistics Volume 23, Number 3 Table 1 Consonant patterns and hyphenation rules. (C = {/3, % 6, C/, O, ~, ,\, #, ~,, {, ~r, p, a, ~, r, (t, X, ~}, CC = {/3% f16, /3,\, tip, 3'6, 7~, 7A, 3'v, &quot;yp, 6p, OA, Or, Op, hA, ~v, ~p, ~r, #v, #Tr, vr, 7rA, Try, ~rr, aft, a7, an, a#, a~r, at, a(t, aX, TG Tit, Tp, Ta, (tO, (tT, (tA, (tip, XS, XT, XA, XP, XP}) c3, and (b) for all words containing a substring cl \[c2c*c3\]o, the point immediately preceding cl, are impermissible hyphen points.</Paragraph> <Paragraph position="9"> In contrast, C1, C2, and C3 specify permissible hyphen points. However, two different interpretations can be given, namely that (i) only one hyphen point is specified by the rules, i.e., the point preceding or (exclusively) following the first embedded consonant Cl or that (ii) two additional hyphen points are permissible: those preceding the first and following the second vowel. Both interpretations specify one common permissible hyphen point, which is, therefore, non-ambiguous. To define this point formally, let CC be the set of consonant sequences, two characters in length, that can begin a Greek word. (The exact definition 2 of set CC is given in Table 1. This set was extracted by the extensive listing of the initial word syllables presented in Setatos \[1971\]).</Paragraph> <Paragraph position="10"> Theorem 1 The strings of the expression VlCl \[C2C*C3\]V2 are hyphenated as Vl - Cl \[c2c*c3\]v2 if c1\[c2\] C CC U C. Otherwise they are hyphenated as VlCl - \[c2c*c3\]v2.</Paragraph> <Paragraph position="11"> Proof Rule C1 indicates that the strings of expression VlClV2 are always hyphenated as Vl-CLV2. These strings are a proper subset of VlCl \[c2c*c3\]v2 and they do not contain consonants C2C*C 3. Thus, c1\[c2\] is degenerated to cl, while Cl c C by definition, and hence VlCl \[C2C*C3\]V2 are always hyphenated as vl - c1\[c2c*c3\]v2.</Paragraph> <Paragraph position="12"> The remaining strings are VlClC2\[C*C3\]V2, and, as indicated by C2 and C3, their hyphen point is the point preceding Cl if ClC 2 E CC or the point between Cl and c2 otherwise. \[\]</Paragraph> </Section> </Section> <Section position="4" start_page="363" end_page="372" type="metho"> <SectionTitle> 2 Some books state that three consonant sequences, namely/~:r, vr, and &quot;1'~ (/b/, /d/, /g/) are excluded </SectionTitle> <Paragraph position="0"> from set CC under specific contexts, see for example, paragraph 81, note 4 of Trantafillidis (1941) and paragraphs 140, and 141 of Tsopanakis (1994). The official grammar book (KEME 1983) however, does not treat these sequences as exceptional.</Paragraph> <Paragraph position="1"> Noussia Hyphenator for Modem Greek Theorem 2 The points immediately preceding Vl and immediately following v2 in the strings of expression VlCl\[C2C*Cg\]V2 do not necessarily constitute permissible hyphen points. Proof Suppose that grammar rules indicate that the points immediately preceding vl and immediately following v2 are also permissible hyphen points. By further taking into account Theorem 1, the assumption is that VlCI\[C2C*Cg\]V2 is hyphenated either as - vl CI\[C2C*C3\]V 2 -, or exclusively as - VlC 1 - \[c2c*c3\]v 2 -.</Paragraph> <Paragraph position="2"> A hyphen is not permitted at the beginning or the end of the word, thus the possibility that the substring is located at the beginning or the end of the word is by definition excluded. Consequently, there is at least one character preceding vl and one following v2. Consider the case where a consonant or consonant sequence precedes v2. If it is at the beginning of the word, according to Lemma 4(a) the hyphen cannot be inserted after the consonant(s) and hence the assumption of a hyphen before vl is false. Respectively, for the case of a final consonant, or consonant sequence after v2, according to Lemma 4(b) a hyphen following v2 is not permitted. Now consider the case of a non-initial consonant or consonant sequence preceding vl. In this case, Theorem 1 specifies one non-ambiguous hyphen point, which will not always be the point preceding vl; hence the assumption is again, false. For the case of a nonfinal consonant sequence following v2, the point implied by Theorem 1 may, in certain contexts, indicate the same point as the assumption, but in other contexts it may not. Nevertheless, in both cases the correct point will always be specified, thus the assumed rule does not need to be reapplied in order to indicate a potentially impermissible hyphen. Cases of a vowel preceding Vl or following v2 remain to be examined. In these cases, Theorem 1 does not define additional hyphen points, and Lemma 4 does not indicate impermissible hyphen points. The issue to examine is vowel splitting independent of consonants.</Paragraph> <Paragraph position="3"> As we shall see in the following section, vowel splitting is not always permissible. To present the proof in its entirety, it would be sufficient to give two contradictory examples, where the points preceding Vl and following v2 are not permissible hyphen points, e.g., c~v-A~ \[av-hl 'courtyard', and :rc~ - A~& \[pa-li6s\] 'old'.</Paragraph> <Paragraph position="4"> Therefore, the assumption does not always hold. \[\] A summarized formal definition of the hyphenation patterns and their associated rules as discussed above is presented in Table 1. Theorem 2 gives further support to the proposition that grammar rules are not capable of completely hyphenating all NL words.</Paragraph> <Paragraph position="5"> Theorem 3 Rules presented in Table 1 are sufficient to completely hyphenate all words containing no consecutive vowels.</Paragraph> <Paragraph position="6"> Proof Every syllable has at least one vowel, thus a word cannot have syllables exceeding the number of its vowels, and it cannot have fewer syllables than the number of non-ending maximal consonant sequences. Let n be the number of vowels in a word not containing consecutive vowels. Then, if the word begins with a consonant or consonant sequence the number of non-ending maximal consonant sequences is n, or otherwise, n- 1. Consequently, all such words have exactly n syllables. According to the definition of a hyphen, these words have exactly n - 1 hyphen points.</Paragraph> <Paragraph position="7"> Computational Linguistics Volume 23, Number 3 According to Theorems 1 and 2, for each substring VlCl\[C2C*C3\]V2, precisely one hyphen point can be derived. All words containing n vowels, none of which are pairwise consecutive, have exactly n - 1 substrings of the expression VlCl\[C2G*GB\]V2, and according to Theorem 1, for each of these, one hyphen point can be derived. Therefore, for all words containing no consecutive vowels, precisely n - 1 hyphens are derived, and thus the rules of Table 1 are sufficient to completely hyphenate these words. \[\] 2.2.1 Elimination of consonant sequences and loanword hyphenation. The examination of consonant splitting has not set any restrictions on the maximum length or even on the existence of certain consonant sequences. The phthong sequences that Modern Greek permits are, however, restricted by principles of grammar that are assumed to be universal. In the case of consonants, these principles state that the maximal consonant sequence is four characters long, and the maximal consonant prefix and suffix of Greek words are three characters and one character long, respectively (Setatos 1971). If these principles had been used in the examination of consonant splitting, the set of all subword patterns of Table I would have been restricted to the set of expression VlC 1\[c2C3c4\]v 2. Similarly, prefix and suffix consonant sequences would be restricted to OCl \[c2c3\] and cdeg, respectively. Furthermore, restrictions of specific sequences not possible in Greek words would further confine these patterns. Loanwords sometimes challenge these principles. Loanwords have been incorporated into Greek since ancient times and include words that cannot easily be recognized as borrowed, because of their adaptation into the above principles. Other loanwords, most frequently words that end in more than one consonant, e.g., ~L,~# \[film\] 'film', have not completely adapted. A sequence of more than three consonants at the beginning of the word is also possible, as in the word F~r,7o~r, cr~ \[gdansk\] 'Gdansk' (the city in Poland), although it is quite infrequent. Cases of more than four consonants may also exist or might appear in new loanwords or, most likely, occur in artificial words such as tongue twisters. Loanword hyphenation is governed by the same grammar rules as the rest of the language. Thus, in order to cover hyphenation of such loanwords, the patterns of Table 1 must not be eliminated. Apparently, this means that loanword hyphenation is independent of the rules governing hyphenation in the original language from which the word was borrowed. For example, although no hyphen point is derived by the Greek rules for the loanwords 9~LA# \[film\] 'film', ~-~ \[tank\] 'tank', words having common derivatives of these, such as ~,kp6n~ \[filmdki\] 'small film' and &quot;r&r,~ep \[t~nker\] 'tanker' are hyphenated as ~L&-#6-~c \[fil-md-ki\] and &quot;rdr,-r~cp \[t~n-ker\].</Paragraph> <Section position="1" start_page="365" end_page="372" type="sub_section"> <SectionTitle> 2.2 Vowel Splitting </SectionTitle> <Paragraph position="0"> As already discussed, the rules presented in Table 1 cover hyphenation of word sub-strings containing at least one consonant; cases of vowel splitting are not covered.</Paragraph> <Paragraph position="1"> Vowel splitting is quite common in Modern Greek and is usually handled in grammar books with prohibitive rules. These are included within the context of the definition of various vowel combinations, but are rarely explicitly included within the set of standard hyphenation rules. 3 Before proceeding to the presentation and analysis of these rules, some terms that will be used need to be defined. It should be noted that the terminology used refers to 3 Vowel sequences are sometimes explicitly mentioned in hyphenation rules, but usually only in the context of consonant sequences. For example, all references to vowels in rules about splitting of consonants may be augmented with &quot;(or diphthongs).&quot; This is not sufficient because vowel sequences that are not next to consonants may split, as in IIo~roM/-w-&~-vov \[papa-i-o-~in-nou\]. Noussia Hyphenator for Modern Greek the orthographic representation of the various word substrings. Phonetic transcriptions are presented for the reader who is not familiar with Greek. Although phonetics is the ultimate basis for hyphenation, our approach is based on the available data, which is the orthographic representation of words, and not a transcription in a phonetic alphabet such as IPA.</Paragraph> <Paragraph position="2"> As it has been previously defined, the term vowel refers to a single vowel or vowel character; V is the set of vowels. Double-vowel blends are phonetically equivalent to vowels, and their orthographic representation comprises two vowel characters. Let 2V be the set of double-vowel blends, 2V = {a~, e~, oL, w, or, c~E, cE, oE, vi, o~;} ({ \[c\], \[i\], \[i\], \[i\], \[u\], \[g\], \[E\], \[~q, \[ti\] }). Some two-vowel orthographic combinations are phonetically equivalent to a vowel-consonant sound. Let VC be the set of such two-vowel combinations, VC = {c~v, cv, z/v, az~, ~, z/~} ({ \[av\], levi, \[iv\], \[~v\], \[~v\], \[iv\] }). Finally, diphthongs and excessive diphthongs 4 are vowel sequences consisting of two parts; each part can comprise either a vowel or a double-vowel blend. Precisely, the set of diphthongs and excessive diphthongs is a proper subset of 0Clf2: fl,f2 E V U 2V}.</Paragraph> <Paragraph position="3"> The prohibitive hyphenation rules regarding vowel splitting are as follows: V1. Double-vowel blends do not split.</Paragraph> <Paragraph position="4"> V2. The combinations av, ev, z/v, a4, ~4, and Z/~ do not split. 5 V3. Diphthongs do not split.</Paragraph> <Paragraph position="5"> V4. Excessive diphthongs do not split.</Paragraph> <Paragraph position="6"> All of the above rules are negative in that they indicate impermissible hyphen points within particular substrings of consecutive vowels. As the goal of the hyphenator is to identify the permissible hyphen points, we interpret V1, V2, V3, and V4 complementarily, i.e., in all other cases, splitting is allowed. It is important to note that the ultimate goal is to specify the permissible hyphen points in any vowel sequence, and not only in the particular substrings of sequences mentioned in V1, V2, V3, and V4. Formally, for every vowel sequence v0... v,_\] of n vowels, and its corresponding set of points Pvo...v,_, = {hi: hi is the point between vi and vi+l, 0 < i < n - 1}, the issue is to identify set IPvo...v,_~ c_ Pvo...v,_1 of the impermissible hyphen points. Then the set PPvo...v,._7 of the permissible points will be their set difference, or PPvo...v,++ = Pvo...v,_7 - IPvo...v,_+. Let us first formally specify the impermissible hyphen points in the particular sequences of V1-V4 rules. The combinations contained in V1, V2, V3, and V4 are distinguished in terms of their constituent elements. All combinations are made up of two parts; both parts of double-vowel blends and combinations c~v, cv, etc. of rule V2 are vowels, while both parts of diphthongs and excessive diphthongs can be either vowels or double-vowel blends. Therefore, the impermissible hyphen point is located between the two parts in all combinations. For double-vowel blends and the elements of VC, which are digrams by definition, the impermissible hyphen point falls between its two constituent vowels, or</Paragraph> <Paragraph position="8"> Therefore, no additional hyphen point is derived for any word where each vowel sequence is of either the 2V or the VC type. Consequently, Theorem 3 is augmented 4 Diphthongs and excessive diphthongs will be defined operationally in the next pages. 5 The zlv combination is infrequently referred to in grammar books (KEME 1983), possibly because it appears in only a small number of words. However, this combination is also considered, because such words are regularly used e.g., ~r/~pa \[efivra\] 'I invented'.</Paragraph> <Paragraph position="9"> Computational Linguistics Volume 23, Number 3 to apply to words containing a maximum of two vowel substrings that are elements of 2V or VC.</Paragraph> <Paragraph position="10"> Lemma 5 The rules presented in Table 1 are sufficient to completely hyphenate all words in which each vowel sequence is included in set Vl Iv2\], such that Vl Iv2\] E V U 2V U VC. For words containing at least 'one n-gram vowel sequence, with n > 2, it is not apparent which vowel pairs, if any, will constitute a double-vowel blend or a VC so that the associated negative rules can be applied. Furthermore, diphthongs and excessive diphthongs comprised of either digrams consisting of two vowels, or tri-grams consisting of a vowel and a double-vowel blend, or tetragrams consisting of two double-vowel blends, need to be precisely separated before the rules are applied. This procedure is called tokenization (see for example, Aho et al. \[1986\]). Tokenization in this case takes as input a vowel sequence and returns a sequential list of maximal non-overlapping tokens of the types 2V, VC and V. Tokens do not overlap in that every vowel of the sequence is assigned to one and only one token. Tokenization might be ambiguous in that it might generate alternative token lists for specific vowel sequences. More precisely, alternative token lists can be generated for sequences where a vowel can be associated to its left or its right neighboring vowel in order to build up a 2V or VC token. 6 However, tokenization is achieved unambiguously because vowels are examined from left to right and a concrete token of the V type is extracted if it does not form a double-vowel blend or an element of the VC set with its subsequent vowel. Otherwise, a token of 2V or VC type is extracted. In conclusion, 2V and VC are disjoint, thus tokenization results in a unique list of tokens.</Paragraph> <Paragraph position="11"> Let any vowel sequence v0... Vn-1 of n vowels, and its k-token sequencer0...fk-1, k < n, fj c V U 2V U VC, 0 <_ j < k - 1, and let Pd0...f~-i = {hi I hj is the point between )~ and J)+l, 0 G j < k - 1}. Let also IPfo...fk_ ~ and PPfo...fk_~ be the sets of impermissible and permissible hyphen points of the token sequence, respectively. Obviously, PPfo...fk-~ = PPvo...v,_~ and Pvo...v,-1 D Pfo...fk-~&quot; According to (1), the elements of their set difference Pvov,...v,_~ - Pfofl...fk-~ are all impermissible. Thus, the points that remain to be examined in regard to their hyphen permissibility are elements of set Pfo...fk_,.</Paragraph> <Paragraph position="12"> This examination will be directed by V3 and V4 prohibitive rules. To conclude, formal definitions of diphthong and excessive diphthong sets would suffice. In this case, the specification of permissible hyphens would be based on whether each sequence of pairwise consecutive tokens is an element of one of these sets.</Paragraph> <Paragraph position="13"> Identification of diphthongs and excessive diphthongs is a difficult task because of the ambiguity that arises when attempting to make specific designations. There are extreme cases where sequences exist whose assignment as diphthongs is context dependent. Some instances remain ambiguous even within precise contexts. Specifically, they may or may not be labeled as diphthongs, depending on the specific dialect or on the personal preference of the native speaker. To deal with this problem formally we shall determine weaker boundaries of diphthongs and excessive diphthongs. When considering hyphenation in regard to diphthongs, the problem is that diphthong definitions are circular, as in Triantafillidis (1978, 33), who states that &quot;two vowels 7 that 6 As a matter of fact, the only sequences that might be problematic in tokenization are in the set of expression {c~ I ~ I ~/ I o} {v I v} {5 I g}. However, the algorithm ensures that the second vowel v or will be associated with the first. For example, the substring ovPS in the word flC/6ovEvoC/ \[ve6ufnos\] 'Bedouin' is separated as ov and E and not as o and vL 7 Double-vowel blends are included in this excerpt.</Paragraph> <Paragraph position="14"> examined and concrete hyphenation rules were derived. However, as Lemma 6(c) explicitly acknowledges, hyphenation is restricted by diphthongs and excessive diphthongs. In this section, we shall proceed to an empirical examination of diphthongs and excessive diphthongs. Taking into account the initial specification that the hyphenator should never generate non-acceptable hyphens, and in order to pare down the enormous sets of candidate diphthongs and excessive diphthongs, we need to isolate the subset of sequences for which splitting is always permitted.</Paragraph> <Paragraph position="15"> The approach followed was to first select all sequences of the above sets that were mentioned in various grammar books as examples of diphthongs and to assign them to the category of &quot;neversplitting sequences.&quot; Then experimental matches were conducted through an electronic dictionary of Modern Greek that encodes 100,000 lemmata and all their inflectional and derivational forms (Vagelatos et al. 1995), and a 13 Mbyte corpus of newspaper articles. By definition, this process could not be automatic because hyphens were not included in the lexicon or the corpus, but there were far too many matches to be examined manually. Manual examination was restricted to those matches having limited frequency of occurrence. Nevertheless, during this process, a systematic method of identifying additional nonsplitting sequences was discovered based on a rule for stressing that states that a stress mark can only be applied to the ultimate, penultimate, or antepenultimate position of a word. Words in the lexicon were hyphenated based on the assumption that all remaining candidates do split. This hyphenation, however, resulted in certain words whose stress appeared on a syllable to the left of the antepenultimate position. Apparently then, incorrect hyphenation had been applied. All diphthong and excessive diphthong candidates included in these words were collected and designated nonsplitting sequences.</Paragraph> <Paragraph position="16"> For the remaining candidates, identification of particular categories of substrings where a general exclusion rule may apply was attempted. Disparate and sometimes contradictory views given in various books (Setatos 1971; Triantafillidis 1978; Petrounias 1984; Mackridge 1987; Tsopanakis 1994) were collected. Their integrity was extensively examined through selection of matching words found in the corpus and in Noussia Hyphenator for Modern Greek the lexicon. This empirical process resulted in formally expressed rules independent of any exceptions. The sets of categories found are not necessarily disjoint, whereas all overlaps always lead to consistent hyphenation. All categories found are explained below, and representative hyphenated examples along with IPA transcriptions and translations are given. In order to avoid confusion, hyphenation is applied to those vowel sequences corresponding to the category currently being explained, and not to the entire word. Formal definitions of all categories are given in Table 2.</Paragraph> <Paragraph position="17"> 1. Examination of excessive diphthong candidates showed that 50% are immediately eliminated, i.e., always split. Specifically, rule F4 states that candidate excessive diphthongs whose first part is stressed do always split, e.g., 7raL&E-c~ \[p~6i-a\] 'education', ~TopE-a \[istorf-a\] 'history', n~-~7~7/ \[~l-isi\] 'pregnancy', fl/c~o~ \[vf-eos\] 'violent', Ae~-o~ \[If-os\] 'smooth', TpoE-a \[trf-a\] 'Troy'. On the other hand, not all diphthong candidates whose second part is stressed split, but the candidates in this set that are not simultaneously excessive do always split (rule F5, Table 2).</Paragraph> <Paragraph position="18"> 2. Another category is associated with the existence of the diaeresis mark on a vowel of either a candidate diphthong or an excessive diphthong.</Paragraph> <Paragraph position="19"> All candidates whose second vowel has both a diaeresis mark and a stress mark do always split, e.g., Ma-~ov \[Ma-fu\] 'May', 7rpo-(c):rap~l \[pro-fparksi\] 'preexistence', e~a-(c)Awcr~/\[eksa-flosi\] 'immateriality'. In addition, all candidates having as first or second token a (c) always split, e.g., 6-(c)~o~ \[~-ilos\] 'immaterial', 7rpo-(c)rc6OecrT1 \[pro-ip6Oesi\] 'prerequistic', AaOpo(c)-aAovpv& \[la0roi-alurvia\] 'glass smuggling'. As well, candidates that have as a first part only a nonstressed ~&quot; always split. Formally, this category is defined by rules F6 and F7 (Table 2). Diaeresis marks were used as a discriminating factor for additional candidates. The single-stress system imposed on Modern Greek in the last decade, states that &quot;if the absence of the diaeresis mark does not generate ambiguity the mark should be eliminated&quot; (Mackridge 1987, 93). Theoretically, this simplification could be applied to a variety of vowel sequences, but examination shows that acceptable words containing such sequences do not always exist, and not all sequences split. We focus on four that always split, namely: w~ - ~a;-~o \[zo-ffio\] 'vermin', 6v - 6-v~oC/ \[~-ilos\] 'incorporeal, immaterial', tv - apxt-wr~p~rTlC/ \[arxi-ipir~tis\] 'butler', tC/ ~rept-(;flptC/~l \[peri-fvrisi\] 'insult' (rule F8, Table 2).</Paragraph> <Paragraph position="20"> 3. Another observation is that all diphthong candidates having an ov or oC/ as a second part, and whose first part is not in set I always split, e.g., nAai-ovaa \[klg-usa\] 'weeping willow', vra-o(;Ata \[da-tilia\] 'drums', #a-o(;vc~ \[ma-tina\] 'barge', wpcd-ovC/ \[org-us\] 'beautiful'. Furthermore, the category is expanded to include candidates whose second part is a double-vowel blend. At this point it should be stressed that there are specific examples where the candidate vowel sequence could linguistically be considered a diphthong based on pronunciation (Triantafillidis 1978, 19). However, during hyphenation they split de facto, e.g., ~rd-~L \[p~-i\] 'goes', c~-~L0c~A~C/ \[a-i0alfs\] 'evergreen' (rule F9, 4. Rule F~0 is associated with those candidate excessive diphthongs that have an ov or oC/ as a first part. Note that the stressed /u/has been Computational Linguistics Volume 23, Number 3 .</Paragraph> <Paragraph position="21"> .</Paragraph> <Paragraph position="22"> already included in F4 because of the stress mark. Detailed examination of the candidates of this category led to the conclusion that the candidates always split during hyphenation. Although sometimes they are pronounced as diphthongs, they are split de facto, e.g., q)eflpov-&pwg \[fevru-~irios\] 'February', flov-71~-6 \[vu-it6\] 'clamor', flov-E~cL \[vu-fzi\] 'it clamors', Be5ov-EvoC/ \[vc6u-fnos\] 'Bedouin', Ov-a)~&~ \[u-alfa\] 'Wales', o~nov-o#eTp\[o~ \[aku-ometrfa\] 'acoustic metrics'.</Paragraph> <Paragraph position="23"> An interesting subset of candidates concerns the intersection of candidate diphthong and excessive diphthong sets. This set is (I U U) x (I tJ U) and although it comprises a relatively great number of elements, most of these have low frequency of occurrence in linguistically acceptable words. It should be noted here that some parts of this set have already been covered by other rules. For the subset not covered, no general rule was formulated but particular instances that always split were identified.</Paragraph> <Paragraph position="24"> These instances are covered by rule Fll. We observed that some cases present ambiguity, while others always split e.g., &-~a~-d#cuo~' \[Si-ist~imcnos\] 'contrary', &-~aTc~#~ \[6i-fstam~\] 'I dissent', &-zlO~#guo~ \[6i-iOimgnos\] 'filtered', &-~preLpwT~n6g \[Si-ipirotik6s\] 'intercontinental', &-~O~la~l \[6i-f0isi\] 'filtering', &-~77/#c~ \[Si-fTima\] 'short story', #v-zl#guog \[mi-imdnos\] 'initiated', #v-~aeLC/ \[mi-fsis\] 'initiate', 7ro~-~-z~g \[pi-itfs\] 'poet', ~ro~-~ac~g \[pi-fsis\] 'you will do', o~w'o~v-c& \[aftofi-fs\] 'self-grown', CmTrAoTro~-C/& \[cpiplopi-fs\] 'furniture-makers', c~,~-eg~ \[ali-fa\] 'fishing', w-& \[i-6s\] 'son', w-oOC/ai~ \[i-o0C/sfa\] 'adoption', 6p~rw-c~ \[~irpi-a\] 'harpy'. There is a different rule for determining the splitting of excessive diphthongs, referred to by both Triantafillidis (1978, 38) and Tsopanakis (1994, 108). It concerns the natural semantics of excessive diphthongs; the avoidance of hiatus in the spoken language. If the flow of speech is constrained by the existence of additional &quot;difficult&quot; or complex phthongs, the pronunciation of the excessive diphthong in one syllable becomes impossible. One such case is that of at least a double-consonant sequence, whose second consonant is p \[r\] followed by a candidate excessive diphthong. That diphthong is not excessive and should always be split (rule F12, Table 2).</Paragraph> <Paragraph position="25"> It should be noted that additional rules covering additional vowel sequences under specific contexts have been found and examined. For example, candidate diphthongs located between the members of compound words prefixed by a preposition do not split. The automatic identification of these instances would be based on a morphological analysis of words, a process beyond the scope of the present analysis. study, the question of whether all sequences presented in the rules of Table 2 exist within acceptable Modern Greek words arises. Eliminations of consonant patterns exceeding a maximum length have already been discussed. Eliminations based on the existence of certain vowel sequences may be possible. However, ancient Greek words and borrowed foreign words that are frequently used in both written and spoken forms contain additional sequences and, as has already been mentioned, their hyphenation is governed by the same rules. Nevertheless, vowel sequences that contain consecutive stressed vowels or double-vowel blends, or consecutive vowels with diaeresis marks do not exist in any word--pure Greek or loan--and thus this can be used as a general Noussia Hyphenator for Modern Greek elimination principle. The patterns flf2,fl,f2 C V U 2V U VC of Lemma 6 contain exactly 301 such sequences. From the remaining vowel sequences of Lemma 6, a few may be identified as non-existent. However, ad hoc compounds that can be readily created may contain even those sequences. Mackridge (1987) notes that, unlike with English, a person fluent in Greek has no difficulty in pronouncing an unknown word. This holds for all vowel sequences in Greek independently of whether they exist within acceptable words. It was thus decided to examine all theoretically possible cases and not to eliminate a priori any sequences.</Paragraph> </Section> <Section position="2" start_page="372" end_page="372" type="sub_section"> <SectionTitle> 2.3 Degree of Hyphenation Completeness </SectionTitle> <Paragraph position="0"> The rules in Tables 1 and 2 guarantee 100% correct hyphenation. The rules in Table 1 are capable of locating all permissible hyphenations of consonant sequences. In regard to vowel sequences, set (V U 2V U VC) has 34 elements and according to Lemma 6 complete hyphenation of vowel sequences depends on 342 = 1,156 vowel sequences.</Paragraph> <Paragraph position="1"> Grammar rules V1 and V2 explicitly define 16 of these, namely the elements of sets 2V and VC, while grammar books refer to 8 diphthongs that never split. Hence, only 16 + 8 = 24 sequences were initially non-ambiguous, while 1,156 - 24 = 1,132 were ambiguous. Rules F1-F11 (Table 2) resolve the ambiguity of 1,015 different patterns.</Paragraph> <Paragraph position="2"> (Occurrences of overlapping patterns have been eliminated by analytically calculating the intersection of the sets of patterns for all pairs of rules F1-F11). In general, 1,156 - 24 - 1,015 = 117 remain ambiguous. Thus, these rules are capable of completely hyphenating at least (1,015 + 24/1,156)'100 = 89.9% of the 1,156 sequences.</Paragraph> <Paragraph position="3"> (If non-existent patterns were eliminated, i.e., those consisting of either two consecutive stressed vowels or of two consecutive vowels with diaeresis marks, the degree of completeness of the hyphenator on a vowel pattern basis could be then computed as: (1,029- 301)/(1,156- 301)'100 = 85.2%). Taking into account rule F12, which resolves ambiguity by proposing additional hyphen points under specific contexts, the degree of completeness increases. Furthermore, the ambiguity of additional sequences can be resolved without proposing additional hyphens, by using the rule stating that stress cannot be applied to a syllable beyond the antepenultimate position.</Paragraph> <Paragraph position="4"> The degree of completeness calculated above does not represent completeness in terms of hyphenated words of real text corpora. The degree of complete hyphenated words of newspaper texts was manually calculated to be over 99%, as expected, because the frequency of occurrence of the remaining ambiguous vowel sequences in words of real texts is relatively low.</Paragraph> </Section> </Section> <Section position="5" start_page="372" end_page="374" type="metho"> <SectionTitle> 3. Implementation </SectionTitle> <Paragraph position="0"> In the previous sections, hyphenation issues were examined as they pertain to Modern Greek with the goal of achieving machine hyphenation that is both accurate and complete to the highest degree possible.</Paragraph> <Paragraph position="1"> Existing hyphenators for Greek are commercial products and usually work on a minimal basis, i.e., finding the hyphen points of consonant sequences and, in limited cases, hyphens of vowel sequences. A research-based version of the Greek TEX typesetting system (Knuth 1986) provides improved hyphenation, but it only indicates splitting for 7.1% of the vowel sequences, which seem to have been selected rather intuitively. Furthermore, three of the sequences, as was observed, can generate impermissible hyphens.</Paragraph> <Paragraph position="2"> The rules presented here have been used for the development of a hyphenator program included in the Microsoft Word for Windows 6.0 and 7.0 (Greek version) already on the market. The system has also been ported to different platforms including Computational Linguistics Volume 23, Number 3 Lotus AmiPro and a specialized typesetting system of a major Greek newspaper. The formal rules and the exact definitions of the sets of vowel and consonant sequences compiled in Tables 1 and 2 are sufficient to implement the hyphenator program. Patterns in Table 2 constitute maximal vowel tokens, which can be derived by a lexical analysis process, while patterns in Table 1 consist of single vowels and consonants. The hyphenator program comprises two parts: the lexical analyzer and the actual hyphenator. The lexical analyzer reads the input characters and produces as output a sequence of maximal V, 2V and VC tokens, as well as tokens of the maximal consonant sequences of the word. For all tokens, the absolute starting position of the token in the input word is maintained, while the length of each token is implicitly defined by the token itself. All consonant tokens are also subdivided according to whether their two character prefix is contained in the CC set or not. Nontrivial consonant sequences are also designated by a flag indicating the occurrence of a p \[r\] suffix.</Paragraph> <Paragraph position="3"> Vowel tokens are further classified according to the nearby resident vowel and consonant tokens. No additional classification of vowel tokens is needed in the following cases: (i) vowel tokens not in the IUU set; (ii) vowel tokens that appear between any consonant sequences; (iii) stressed vowel tokens in the I U U set that have as a left neighbor a consonant sequence with an/r/ suffix; (iv) vowel tokens that simultaneously have stress and diaeresis marks. The remaining vowel tokens are characterized explicitly as stressed I, nonstressed I, and U.</Paragraph> <Paragraph position="4"> The actual hyphenation phase follows, where the hyphenator traverses the token sequence, identifies all ordered sequences of type (a) Ivowel token I - Iconsonant token I - Ivowel token I, and (b) Ivowel token I - Ivowel token I, and applies the corresponding hyphenation rules. The resulting hyphen points are given in terms of the absolute starting position in the word of the first or the second token of the sequence currently being examined.</Paragraph> <Section position="1" start_page="373" end_page="374" type="sub_section"> <SectionTitle> 3.1 Hyphenation of Words in Uppercase </SectionTitle> <Paragraph position="0"> There is no one-to-one correspondence between uppercase and lowercase letters. The main difference is that stress markings are not applied to words whose letters are all written in capitals while the diaeresis mark is maintained in capital letters, u Consequently, the transformation of any uppercase word to lowercase and back to uppercase again loses no information. The opposite transformation is not always without loss of information. To decrease the complexity of the hyphenator, we used only lower-case patterns. Thus, uppercase words are transformed to lowercase, hyphenated, and transformed back to uppercase forms. 12 Hyphenation patterns of consonant sequences (Table 1) are unchanged because consonants do not take stress marks and, moreover, the vowels contained in these patterns are independent of stress. On the other hand, many of the patterns derived for the hyphenation of vowel sequences cannot be applied to capitalized words because the most important discriminating factor in diphthong identification is stress marking, and uppercase letters (Section 2.2.1) lack stress markings. This observation certainly implies the tendency for words in uppercase to have fewer hyphens than their lowercase equivalents. This inconsistency cannot be resolved without additional information about the position of the stress mark.</Paragraph> <Paragraph position="1"> 11 In words written with both capital and lowercase letters, an initial capital letter may have a stress mark. 12 The tranformation takes into account the existence of a final \[s\] in the uppercase word and tranforms it to the final ~ instead of or, according to a corresponding transformation rule.</Paragraph> <Paragraph position="2"> Noussia Hyphenator for Modern Greek</Paragraph> </Section> </Section> <Section position="6" start_page="374" end_page="374" type="metho"> <SectionTitle> 4. Discussion </SectionTitle> <Paragraph position="0"> Overall, it was feasible to make an analytical examination of the hyphenating system mainly because most of the known hyphenation properties were expressed or could be expressed in terms of orthographic representation. In Greek, this representation contains much of the pronunciation information, which is the ultimate basis for hyphenation in every language. When analytical work reached the point where the available data could no longer provide the necessary pronunciation information, it was replaced by empirical work.</Paragraph> <Paragraph position="1"> A similar process would be difficult to conceive in languages in which the orthography and pronunciation are significantly different. It should perhaps be stated that the system itself may not have the capacity to be generalized to other languages.</Paragraph> <Paragraph position="2"> It is interesting to note that rules governing the splitting of subword patterns exist in languages such as English, but their application is usually determined by orthographically inexplicit information, such as the existence of a long, short, or stressed vowel in some position of the pattern. Different types of properties typical of such languages as English and German are based on morphological considerations that were not an issue for our system. For example, in English &quot;common roots&quot; is an issue in hyphenation of compounds, whereas in Greek, it is not. Such properties are not likely to be similarly expressed in a pattern-based model. The process of developing a similarly performing hyphenator for such languages would be different. Identification of certain patterns would presumably be based on an empirical rather than an analytical process. Automatic extraction of common hyphenating properties from on-line hyphenated dictionaries is known (Liang 1983). The resulting patterns tend to be more detailed and extended. Lists of exceptions seem to be obligatory in such an approach because their lack would lead to the generation of impermissible hyphens.</Paragraph> </Section> class="xml-element"></Paper>