SYLLABLE-IIASED MOI)EI. FOR TIIF, K()III~AN MORPIIOLOGY 
Seung-Shik Kang 
Dept. of Computer Science & Statistics 
IIansung University 
Seonl 136-792, Korea 
Abstract 
This paper describes a syllable-based 
computational model for the Korean 
morphology. In this model, morpholovical 
analysis is considered as a process of 
candidate Generation and candidate selection. 
In order to increase tile performance of the 
system, the number of candidates is highly 
reduced and tim system require.s small 
number of dictionary accesses. Idiosynchratic 
features of a syllable, formalized as a 
characteristic fnnetion, make it possible to 
reject implausible candklates before dictionary 
confirmation, instead of a letter, syllable is a 
basic processing unit for the practical 
implementation of the morphological analyzer. 
1. Introduction 
There are two linguistic phenomena that 
are interested in Ihe processing of 
computational morphology. They are 
morphological transformation and morpheme 
identification. Two-level model and 
syllable--based formalism focussed on the 
problem of morl)hological h'ansformntion 
IBear88, Cahig0, Kosk83\]. Morpheme 
identification is an importmlt issue in some 
languages where two or more morphemes ave 
combined to make a word, a compound word, 
or a sentence without any delimiters between 
morphemes \[Abe86, Chen92, Paeh92\]. 
The goal of morphological analysis is to 
find the base form of morphemes in a word. 
It consists of a generation of analysis 
candidates and the selection of ton'cot 
candidates. Analysis candidates are generated 
as a reverse process of word formation rules'. 
morpheme isolation and morphological 
transformation. Then, correct candidates are 
selected by the coherence restrictions among 
Yung Taek KiIn 
Dept. of Computer Engineerin,q 
Sc~ul National University 
Seoul 15l-742, Korea 
adjacent morphelnes and dictionary 
confirnmtion. The morphological mmlyzer tries 
to generate all the possible candidates only to 
accept the correct candidates. 
2. The l)roblem 
Two.-level model is widely known to be n 
eomputationnlly efficient method for the 
practical system on the condition thai the 
munber of rules is smnll\[Bart86, Kosk88\]. 
Howew.w, when the size of the rulebase is 
large it causes an exponential probleln. In 
case of the Korean langttage, it is common 
that a stenl is succeeded by I,,rammatical 
n~orphemes. If we use the twe-level model 
for a practical sys/eln, a small set of 
phonological rules and a large set of 
rnorl)helne isolation rules are required because 
there are several thousand colnbinalion.~ of 
grammatical morphemes\[Zhan90\]. 
in order to solve the problem, we can try 
a 2--pass algorithm. All the l)OS,~ible 
morl)hemes are isolated, and then do a 
phonological processing. It is also l~ossil)le to 
do a phonological processing first and 
morphemes are isolated at the second Imss. 
\[iowever, this Mad of solution causes m~olller 
Selious problem that o(;eut+s l'totll the 
conditional resh'ie{ions: (1) ,,:eme 
morphologieal transformation occurs not only 
at a stem but also at a functional inorpheme, 
(2) there are eooccurrenee restrictions 
between two morl)hemes, (3) morphological 
tl'ansfovlnat\[oll OCCtll'S only for the El)e.cinl 
word grotlp. 
3. Syllable-based writing system 
The writina system for most languaffes i,~; 
based on tile letter set called as alphabet. 
Instead of' a letter se, t, Chinese writing 
221 
system is based on the set of characters that 
consists of one or more letters. Each 
character is a meaning unit and words are 
represented by the combination of characters. 
In case of Korean, words are represented by 
one or more characters as in Chinese. The 
difference is that Korean character is a 
well-formed written syllable, which is a 
sound unit rather than a meaning unit as in 
Chinese. A written syllable is a combination 
of two or three sound symbols, which 
corresponds to a spoken syllable in a 
one-to-one fashion\[Chun90\]. Korean words 
are constructed as follows based on the 
syllable unit. 
word ::= { syllable )" 
syllable ::= open_syll I closed syll 
open_syll ::= initial + medial 
closed_syll ::= initial + medial + final 
4. Idiosynehratic features of syllable 
There are 11,172 syllables in the modern 
Korean language( = 19 initials * 21 medials * 
27 finals plus one for null). However, it is 
i,)teresting to investigate the usage of 
syllables to make a word. About 2,350 
syllables cover more than 99.9% of the 
modern Korean words. Furthermore, 267 
syllables(11.36% of 2,350 syllables) are only 
used for the surface form of verbs, and 
grammatical morphemes are combinations of 
151 syllables(6.43% of 2,350 syllables). In 
addition, only a very small set of syllables, 1 
to 46 syllables for each type of irregular 
verbs, are tied to the morphological 
transformation \[Kang93\]. This ldnct of 
information is very useful to improve Ihe 
efficiency of the morphological mmlyzer. For 
example, if a syllable used only for the 
surface form of verb is found in a word, we 
can easily guess that the word is a verb, the 
string before that syllable is a stem, and the 
rest is a grammatical morpheme. There is no 
other chance for the different result except 
typographic errors. 
Suppose that X is a set of syllables that 
are used at the first position of grammatical 
morphemes. We can easily guess the syllable 
boundary position of grammatical morpheme 
in an n-syllable word at syllable .v~, where xj 
X and i :~ j K n. There is no the 
possibility at other positions. It is based on 
the fact that only 48 syllables are used for 
the first position of postl)ositions and 72 
syllables for the first position of final endings 
in the Korean language. 
Three Idnds of syllable features are 
defined from where the features are extracted. 
'Unit feature' is a syllable featm'e defined on 
the syllable itself. If a syllable xi itself has an 
idiosynchratic feature J\], then xi has a unit 
feature g. 'Partial feature' is defined by the 
component of a syllable. A syllable xi is 
called to have a pm'tial feature 1)~, if xi 
includes a component 1)~, as an initial, a 
medial, or a final letter. 'Successive featlwe' 
is a mete-level feature defined for the 
adjacent two syllable features. For example, if 
there is a set of two successive syllables 
xixi,l that construct grammatical morphemes 
and that cannot construct any noun/verb, 
then the boundary position of a grammatk.'al 
morpheme is possible only at syllable xi or 
Xi~ 1. 
5. Characteristic function 
Idiosynchratie features of syllables are 
represented using a characteristic set of 
syllables. Suppose that a part of speech(i), 
morpheme length(j), and the position of 
syllable in a word(k) are discriminating 
features of a characteristic set. I.et IPi be a 
set of syllables that are used for a part of 
speech i, ()j be q sol: of syllables Ihal are 
used for the morpheme length j, and ~k be a 
set of syllables that are used for the k-th 
l)osition of syllable in the word. "\['hen, a 
characteristic set of syllables A<i,j,k> is an 
intersection of Pi, {~j, and ~l{. 
A<i,j,k> = Pi \["1 ~)j I"l ~k 
For the characteristic set of syllables 
A<i,j,k>, characteristic function CA<ij,k> is 
defined fi'om A<i,j,k> to {0,1). 
\[Definition\] characteristic function 
Let X be a set of Korean syllables and 
A<id, k> be a characteristic set of syllables 
222 
where A<ij,k> ~ X for pm't of sI)eech i, 
morpheme length j, and the k-th position of 
morpheme. Define the function 
CA<u,k> : X ---> ( 0, 1 ) 
CA<ia,k>(x) = \[ 1, if x E A<i,j,k> 
L 0, otherwise 
A lot of characteristic functions are 
possible by the arguments i, j, and k. 
However, some of them are chosen for the 
morpheme isolation or morphological 
transformation, and they are reorganized as 
syllable infornmtion function(/) in order to 
find out the characteristics of a specific 
syllable. The value of f(x) on a syllable, x is 
defined by the characteristic function 
CA<i,i,k>(X). Suppose that a be the nuinber of 
parts of speech, /3 be the maximum number 
of syllables in a word, then a lriple A<i,i.l,> 
can be transformed into At by the following 
expression. 
t : (k--1)*a*/~ + (j--1)*ct-,-i 
(1 ~ i g a, t ~j~ /3, 1 ~ k .< B) 
Let g be a flmction from a set of syllables to 
a Cartesian product of characteristic functions 
and h be a function from a Cartesian product 
of characteristic flmctions to an integer. 
Then, function K and h are defined as 
follows. 
g':X --> CA~ x CA2 x ... x (;A,~ 
~(X) = (CAI(X), CA2(X) ..... CAn(X)) 
h:CA1 X GAg X ... X CAn----~> 
h(CM(x), Cm(x) ..... CA.(x)) : 
~(CAi(X)*W(i)), where W(i)=2 il 
N 
Now, syllable information flmction f is defined 
as a combination of h and g. Domain of the 
flmction f is a set of syllable and the range 
is a bit string of integer where bit position t 
Ls used for the specific feature and tile wfiue 
of the t-th bit means whelher tile syllaNe 
has the corresponding feature or not. 
f: X ----.2> N 
fix) = 27 (CAi(X)*W(i)), where W(i)::2 i 1 
i 
6. Syllable-based formalism 
Mot'l)hological analysis system is 
formalized as a function F. Tile domain of 
function F is a set of words and the range 
of F is a Cartesian l~roduct of a set of 
morl~hemes and their morl)ho-synlactic 
features. 
y : F(x) 
F: W ----> W' 
W : a set of words 
W'-- M × F 
M : a E(.'I Of lnOrl)hellles 
I,': a set of 
mOrl~ho-- '.~y n t,qc.tic featurc, s 
SUI)l)OSe Ihat mi be. a root form of loxic.al 
inorl)hemo, fa be a con~.bination of l'eat:ures and 
rk be a two--level rule. Then, function F is 
defined as follows, FuncLion p is to check tile 
condition of two--level rulos, l:unction (1 
go.neral(.'s a combhmtion of morl)ho-synta(;\[ic 
feattu'e.~; of a word. 
I,'(word) : \[ a set of (mi, Ji), 
if H'li ~ p(worcl, vt:) slid 
Ji - q(worcl) 
¢, olherwi~e 
,C;ome morl)ho- syntactic fe,~lttlI'os aI'e 
defined for the mori~hological analy.qis, l>arts 
of Slleech, irre0,ular types and o{her f(.'atttl'(2s 
arc.' dc'fir..~d as follows. 
I)O,~ = ( N, V, ADJ, AI.)V, i)1,7i', ... } 
irtype = ( B, 1), G, lI, l., N, I{, S, U ) 
prefix :- {In'efix 1, prefix '2 ..... Inefix- n) 
suffix = {suffix--l, suffi×-2 ..... .~uffix-n) 
1)res, llat~l, ful:, Ill), hen .... :: ( -~, ) 
A syllable-based rule consists of loft-hand 
side(IA\[S) and right--hand ,~,ide(l{\[IS). They 
are described by Ihe following primitive 
func/ions. 
syllable(word, i) 
subsyl(word, i, j) 
CA<j,> (X) 
irreg t:ype(word) 
223 
initial(x), medial(x), final(x) 
noun(word), verb(word), adv(word), 
det(word), impr(word) 
change(x, y, z, INITIAL/MEDIAL/FINAL) 
insert(x, word, i): 
insert syllable x at i-th position 
delete(word, i): delete i-th syllable 
'syllable(word,i)' fetches i-th syllable of 
word and 'subsyl(word,i,j)' is to get j 
syllables starting from i-th syllable of word. 
C^<ia,k> is to check whether a syllable x 
belongs to a syllable characteristic function or 
not. For example, b-irregular rule in Korean 
is described as follows. Set 'AT' is supposed 
to be a characteristic set of the last syllables 
of b-irregul~ verbs. 
CAr(sill) = 1, 
head <--subsyl(word, 1, i-i), 
change(head\[i-i\], null, 'p(tl )', FINAL), 
verb(head) <-- IRREG_B 
tail <--subsyl(word, i, n-i-l), 
change(tail\[I\], 'we(M)', 'e( q)', M~maO 
The b-irregular rule is described as a 
syllable-based formalism and it is applied 
after the isolation of stem parts. So, stem 
and ending candidates should be identified 
first. 
input word 
MORPHEME BOUNDARY 
1 
MORPH. ALTERNATION 
I 
DICTIONARY ACCESS 
analysis result 
Fig. morphological analysis 
Overall view of the morphological analyzer 
is shown in the figure. The first step is to 
find the morpheme boundaries using 
characteristic function for syllables. Stem 
candidates are generated at the second step 
by the phonological rules. Phonological rules 
are only applied at a syllable w\[i\] if and only 
if w\[i-1\] is an element of a required 
characteristic set, and w\[i+l\] is the beginning 
syllable of other morpheme. 
Following algorithm is to guess the 
beginning position of gralnmatical morpheme. 
In the algorithm, GM_SET1 and GM_SET2 
are characteristic sets for the fi,'st and the 
rest syllables of grammatical morphemes, 
respectively. 
algorithm boundary_syllable(word) 
syllable word\[\]; /* input word */ 
begin 
n : nsyl(word); 
for (i : 1; i < n; i = i+D ( 
if (word\[i\] E GM_SET1) ( 
if (word\[i+1\] ~ GM_SF.T2) 
return(i); 
) 
) 
return(n); 
end 
Algorithm. morpheme boundary 
7. Evaluation of the model 
There are two types of candidates for a 
word. The first type is generated by the 
morpheme isolation at all the syllable 
boundary and tile second type is generated 
for each morpheme candidate by the 
phonological rules. We can count the number 
of candidates as follows. Suppose that a be 
the maximum number of syllables that causes 
an inflexion, /3 be the candidates for prefinal 
endings, and ?" be the maximum number of 
inflexions for one syllable, in case of Korean, 
ct is less than n, 13 is 2, and ~' is 3. If a word 
consists of n syllables, then lhe maximum 
number of canclidates is 10n+8a+2. 
- candidates for 1-morpheme word 
and (notm+postposition) 
224 
0) 1-morpheme word: 1 
@ noun + postposition: n-1 
@ noun + suffix + poslposition: n-2 
- candidates for irregular verbs and 
( verb + ending) 
@ verb + ending: n-l+a 
(D verb + prefinal_ending + ending: /3 
(6) verb infiexion: ?'(n-l+a+~) 
Q verb + suffix + ending: 
(n-2+a+B) + ?'(n-2+a+/~) 
C(n)= • + ® + @ " @ + @ + ® ' © 
= 1 + (n-l) + (n-2) + (n-l+a)+ fS~ 
?'(n-l+a+fl) + (n--2+a+/D + ?'(n-2-,a+~) 
= (4+D')n + (2ct+gar+2f~+217~'-37-5) 
= i0n + 8a + 2 <--- 17=2, ~'=3 
It is very inefficient to look up the 
dictionary for all the implausible stems and 
grammatical morphemes. Only plausible 
candidates are generated using the 
idiosynehratie features of syllable. Now, 
maximum number of candidates is connted as 
a constant and tile number of dictionary 
accesses is highly reduced. 
O 
@ 
@ 
@ 
@ 
@ 
® 
1-inorpheme word: 1 
noun + postposition: 2 
nOtlll 
verb 
verb 
verb 
verb 
+ suffix + postposition: 2 
+ ending: 2 
+ prefinal_ending + ending: 2/~ 
inflexion: ?,(2+2/~) 
'" suffix ~ ending: (2~28),2"(2~2/3) 
C(n) - O ~ @ + @ + (,1) + (9 + (6) , (7; 
= 2fl + 42" + 4/D' t- 9 
The previous algorithm has O(n) complexity 
because it tries to isolate function word at all 
the syllable positions. However, if syllable 
features are used then the worst--time 
complexity of the Korean morphological 
analysis beeoines a constant. In this case, we 
should use lhe fact that there is no stem that 
includes two successive syllables 'xy' such 
that 'xy' is a substring of grammaticaI 
morpheme. 
8. Conclusion 
Syllable-based formalism is proposed to 
solve the problem of morphological alternation 
with morpheme isolation where many 
candidates are generated by tile phonological 
rules. It improved the worst--time complexity 
O(n) to a constant, and tim nulnber of 
dictionary accesses is highly reduced using 
tile syllable features that are extracted froin 
words and formalized to be available for a 
morphological analyzer. They are very useftfl 
for the isolation of morphemes, which make it 
possible to guess the boundary position of a 
stem without accessing the dictionary. They 
are also useful to reject the implausible base 
forms from a w~rb. 
Charaeterislic set of syllables and 
syllable-lmsed formalism may be applied for 
lhe languages whose words consists of 
syllables and morphological operation is 
described as a syllable- to-syllable 
transformation to increase tile performance of 
tile morphological analyzer. In addition, 
idiosynchratic features of syllable may be 
used for the analysis and recognition of 
imturnl languago.s such as spelling check, 
phonological representation of words, and 
character recognition. 
Korean morphological analyzer was 
implemented at IBM-PC 486 tlsing C 
language. The system analyzed Korean text 
at a speed of about 100 words/sec. 
I{I",FEI{I~N CI~S 
\[Abe86\] M. Aim, Y. Ooshi,na, K. Yuura and 
N. "l'akeichi, "A Ka,m-Kanji "l'ranslatk)n 
Sy.'-;ttnn for Non-Segmented hlput Sentences 
Based on Syntactic and Semantic 
Analysis," Proceedings of tlle llth 
International Conference on Computational 
IAnguisties, pp.280-285, 1986. 
\[Bart8(~\] I';. Barton, "Computational 
Complexity in Two- Level Morphology," 
24th Annual Meeting of tile Association for 
Computational I,inguisties, 1986. 
\[Bear88\] J. l{ear, "Morphology and Two-level 
Rules and Negative \]bile Features," 
Proceedings of the 12th International 
Conference on Coml)utational Linguistics, 
22.5 
vol.3, pp.28-31, 1988. 
\[Cahi90\] L.J. Cahill, "Syllable-based 
Morphology," Proceedings of the 13th 
International Conference on Computational 
Linguistics, vol.3, pp.48-53, 1990. 
\[Chen92\] K.J. Chen and S.H. Liu, "Word 
Identification for Mandarin Chinese 
Sentences," Proceedings of the 14th 
Internatioanl Conference on Computational 
Linguistics, Vol.1, pp.101-107, 1992. 
\[Chun90\] H.S. Chung, "A Phonological 
Knowledge Base System Using 
Unification-based Formalism A Case 
Study of Korean Phonology -," Proceedings 
of the 13th International Conference on 
Computational Linguistics, pp.76-78, 1990. 
\[Kang93\] S.S. Kang, Korean Morpholo~?fccll 
Analysis using Syllable Information and 
Multi-word unit Information, PhD 
dissertation, Seoul National University, 1993. 
\[Kosk83\] K. Koskenniemi, "Two-level Model 
for Mo~hological Analysis," Prec. of the 
8th International Joint Conference on 
Artificial Intelligence, pp.683-685, 1983. 
\[Kosk88\] K. Koskenniemi, "Complexity, 
Two-Level Morphology and Finnish," 
Proceedings of the 12th International 
Conference on Computational Linguistics, 
pp.335-339, 1988. 
\[Pach92\] T. Paehnnke, O. Mertineit, K. 
Wothke and R. Schmidt, "Broad Coverage 
Automatic Morphological Segmentation of 
German Words," Proceedings of the 14th 
Conference on Computational Linguistics, 
pp.1219-1222, 1992. 
\[Zhan90\] B.T. Zhang and Y.T. Kim, 
"Morphological Analysis and Synthesis by 
Automated Discovery and Acquisition of 
Linguistic Rules," Proceedings of the 13th 
International Conference on Computational 
Linguistics, pp.431-436, 1990. 
226 
