A TWO--LEVEl/ ~ .... ~ ' MORt tfOI~OGI(.,AI~, ANALYSIS OF KOREAN 
Dcok-Bong Kim, Sung-Jia Lee, Key-Sun Choi, and Gil-Chang Kim 
Center for Artiticial Intelligence Research 
Computer Science Department, KAIST 
373-1 Kusong-Dong, Yusong-Ku, Taejon 305-701, Korea 
E-math {dbkim, kschoi}@csking.k~fist.ac.kr 
ABSTH,AGT 
The two-level morThology model has received a 
grcal deal oJ attention and ha,s been implcmcnlcd for 
languages like li'ianish, English, JalmnCSe , Ru,ssian, 
l,'rcnch, and so on. However, this model has been 
claimed to be inapproprialc \]or Korean morphological 
analysis, because the complez" conjugation (inflection) 
and agglutination in word formation, and the syllabic- 
based representation oa t. worda may lead to a huge a'am- 
ber of two-level morphological rules, ht this paper, we 
show that the twoJcvcl model can be succcs,sJully ap- 
plied to Korean and its rule size i~ limiled to only 52. 
Art czlensiou of two-level morphology is described for 
Korean language. 
INTROD U CTION 
The two-level morphology model (Koskenniemi, 1983; 
Antworth, 1990; Barton, 1986; l(~itchie, 199:1; Sproat, 
1992) is a well-known comi)u~,at, ional model of mor-- 
phology, which ha~ adaptability a~ well ~u~ siml)lic- 
ity. In t)ractice, this mo(M ha.s been successfully al)-. 
plied to several languages including Finnish, El@ish, 
Japanese, II, ussim h and French. However, the two- 
level model ha~ been considered to l)c inapl~rol)riate 
tbr Korean (Kang, 1992; Kwon, 1991). That is, the 
two-level morphological analysis of Korean is believed 
to be diliicuit and infcasible because the complex con- 
jugation (inItection) ainl agglutination in word forma- 
tion, and the syllable based representation of words 
may lead to a huge mmlber of two-level morphologicM 
rules. In this paper, we show that the two-level model 
can be successfully applied to Korean and its rule size 
is limited to only 52. 
This paper presents a successful two-lcvel system 
\[*or Korean morphological analysis. The system wa.s 
ba~ed on a shareware PC-KIMMO (Antworth, 1990); 
however, wc extended the I/O component of I'(J- 
KIMMO to handle Korean alphabet HANUUL; we 
c(m.~,ructed a Korean dictionary and a Korean mor- 
phological grammar (i.e., morphotactics and spelling 
rules) tot the I'G-K1MMO; wc also used a shareware 
KGI';N (Miles, 1!191) to translate the linguistic spelling 
rules into the executable automal, a (i.e., tinite state 
transducers (FSTs)). This paper focuses on the dic- 
tionary and the morphologicM grammar for Korcalt. 
TWO-LEVEL REPRESENTATION OF 
KOREAN WORDS 
The two lewd model is conceLned with directly map- 
ping bctwcen two rcprescntations of a word: (1) tile 
sur\]hcefo,'m (SF) ~ it appears in the text, and (2) the 
lexical \]orm (LF) which is represented ms a sequence 
of ba.~ic morphs and diacritics (c.g., '+' to mark mor-. 
pheme boundary and '~' for word boundary). As a re 
suit, an input word in the two-level modcl is analyzed 
by mapping the word itself (SF) to a sequence of le~ 
ical forms in dictionary without intermediate stages. 
In this section, we present a two-level representation 
of i(ore~m words. 
'lb understand the two-level description for Korean 
ntorphology, one should be properly familiar with Ko- 
rean alphabet mid their transcription system. So we 
tirst describe them. l"or ordinary writing system~ the 
Korean alphabet consists of 40 letters: l0 purc vowels, 
11 compound vowels, 14 basic consonants and 5 dou- 
ble consonants. A Korean word is represented with a 
sequence of syllables; a syllable can be made up of a 
consonant, a vowel, and a consonant; there are scv-. 
eral tbrms of syllables (e.g., CV~ CVC~ VC, V~ and 
C forms); and initial consonant lettcr may not be 
distinguished front Iinal consonant letter, iiowever, 
the initiM consonant and the final consonant iiiust be 
distinguished from each other for successful two-level 
535 
I)lire 
Vowels 
Compound 
Vowels 
Basic 
Consonants 
Double 
Consouants 
Table 1: The transcrip.!,ion of Korean . .alphabet (IIANGU.~. 
tIANGUL 
IPA 
MYGOI)\]~ 
IIANGUL 
IPA 
MYGODE 
tIANGUL 
n'A 
MYCOD~:(I) 
MYGODE(F) 
IIANGUb 
IPA 
MYCOUE(1) 
MYGODIS(F) 
o o u c e i i ii 5 
a e o u 8 9 i wu wi 
ya y0 yo yu yC ye wo we wa wE iy 
ya ye yo yu y8 y9 we w9 wa w8 yi 
*J ~ I::: i=l vj 1=\[ ~ O ~, 2~ 
k n t 1 m p ~ ~ E ~h 
g n d 1 m b s j c 
G N D L M B S * J C 
k' t' p' s' ~' 
q \[ r v z 
q v 
kh t h ph h 
k t p h 
K T P H 
systcm; if not, it might cause a lot of useless work 
(i.e., invalid mapping) and incorrect results because 
i-th consonant in a word is not clear whether it is an 
initial consonant or afinM consonant. Furthermore, to 
write two-level spelling rules for PC-KIMMO, each of 
Korean alphabet must be m~pped to ASCII character 
on the keyboard. Therefore, we devised a transcrip- 
tion system for Korean Mphabet a~ shown in Table l, 
which ha~ the following features: 
• There is rio letter corresponding to the initial con- 
sonant , o,. We did not consider the letter be- 
cause it is a sort of an orthographic filler \['or the 
ordinary writing system and is not pronounced. 
• The initiM consonant letters are not the same as 
the finM consonant letters. (To sec this, compare 
the initial consonants MYGODE(I) with the final 
consonants MYCODE(F) in Table 1.) 
• Each of compound vowels is represented by a pair 
of two letters: a semi-vowel letter (i.e., y or w) 
and one of pure vowel IO, ters excluding 'ql'/fi/ 
and 'a\]'/5/; here ,?\], and '-~t' are treated as the 
compound vowels. 
• There are two archiphoncnic letters: (1) the 
archiphoneme A for the proper treatment of vowel 
harmony l, which can bc changed into NULL 
1Modern Korean hms a "diagonal" vowel harmony 
(Ahn, 1985) kept in only one area o\[ word formation, that 
is, between the tinal vowel of a verbal stem and the follow- 
ing o-initial suffix. This system works in the 0-initiM suffix 
symbol 0, a vowel letter a, or a vowel letter 9 
by context; and (2) the arctfiphoneme I for the 
proper treatment of predicativc postposition 'ol' 
/i/, which can be changed into either 0 or a vowel 
letter i by context. 
Wc believe that our transcription system makes it sim- 
ple and clear to describe two-level spelling rules of Ko- 
rean, and it enables the two-level processor to handle 
elliciently the complcx spelling changes. 
IIcre, three spccial symbols are used properly to 
treat lexical irregularities of Korcau verbal morphol- 
ogy: + for regularity, X for '/c'-irregularity, and $ for 
all irregularities excluding the '/d-irregularity; X must 
be differentiated from $ because of the following rea- 
sons. In Korean morphology, most of verbal stems 
ending in the syllable '~' //i/ are irregular. The fi- 
nM syllable '~'//i/of the stem, when tbllowed by the 
vowel '°t' /o/ and preceded by any vowcl other than 
the light vowels ('o}' /a/ and '22/o/), is changed into 
'el'/Io/and the consonant '~'///is added to the pre- 
ceding syllable. We call it '/_'-irregularity. For exam- 
plc, the vcrb stem '~' ~hi-{i~ (to flow) plus the suffix 
'ot' /0/ (INFINITIVE) becomes the verbal word '~et' 
/hil-to/. tlowever, there is 'le'-irrcgularity which oc- 
harmony where o has an alternation a if the final vowel of a 
verbal stem is a light vowel a or o. For exampl% the verb 
stem '_W /bol (to see) plus the sullix 'oI' /o/ (INFINI- 
TIVE) becomes the verbal word 'Lo~' /bo-a/. tlowever, 
the verb stem '~' /cu/ (to givc) plus the suffix '"t' /0/ 
(INFINITIVE) becomes the verbal word '.~o\]' ~co-O~. As 
a result, the archiphoncme A is used for the initial vowel 
o of suffixes, which is to distinguish it from 0 elsewhere. 
536 
curs in the same context ms 'L'-irregularity: it causes 
only to be changed the following vowel 'o1' /o/ into 
'~t' lie/; for example, the verb stem 'o1~' li-~/(to 
arrive) plus the snflix '0t'/a/(INFINI'rlVE) becomes 
the verbal word ' °1 ~et'/i-/i- to/. Therefore, a mecha- 
nism is needed to treat them properly. 
One of the special symbols is used to represent a 
specific lexical form, and is ahnost placed at tlm e,d of 
tlle lexical form. For example, the verbal stem tub has 
two meanings, i.e., "curved" as an adjective and "grill" 
as a verb. Ill this case, the probleln is on the ditrcrence 
between the variation \[brine for adjective and those for 
verb; when it is combined with a sultix A, tim surface 
form becomes either the guile as adjective, or tim guwc 
as verb. 'Fo distinguish betwcen them, the following 
lexical fi~rms can be listed in dictionary: gvH+ for 
regular adjective, and guH$ h)r '1\]'-irregular verb. 
WORD STI{UCTURE AND LEXI- 
CONS 
The word structure in general denotes knowledge of 
tin: internal morpheme combinations of known words. 
As a result, it shows how morl)hemes can combine to 
l'orm valid words; it is important to a proper word 
recognition. In tim two-level model it is represented 
with linked lexicons, i.e., with coniinvaliou claaaes of 
morphemes. 
The contimmtion chmses used in our lexicovs are 
as follows: i.terjection (IS), prenoun (Pit), adverb 
(A\]3), noun (iNN), pronou,l (PN), numeral (NU), 
verb (VB), adjective (AJ), verbalizer (Vit), postpo- 
sition (PP), l-po~tposition (I1'), nominal-prelix (NF), 
verbal-preIix (VF), preliual-ending (PE), final-ending 
(FE), nominal ending (NE) =, Begin, and End. Every 
class indicates a lexicon, lIowew:r, the 11c9iu and End 
are some special lexicons; llcgin amounts to the ini- 
tim state in automata, and End has tile same role as 
the final state; in fact, there is no lcxical entry. The 
following ~hows our linked lexicons. 
Begin-> interjection I pronoun I adverb 
I nomt I pronoun I numeral \] verb 
I adjective I nominal-prefix 
I verbal-prefix 
~Thc rwrui,cd-cndir~gbclongu to finM-cnding; it consists 
of uominM endings, setttcutia.1 endings, and connective end- 
ings. 
interjection-> End 
prenoun -> End 
adverb -> End I postpoeition 
nominal-prefix -> noun 
verbal-pre~ix -> verb I adjective 
noun -> End \[ postposition 
I-po~tposition I verbalizer 
pronoun -> End I postposition 
I-postposition 
numeral -> End I postposition 
I-postposition 
verb -> prefinal-ending I final-ending 
nominal-ending 
adjective -> prefinal-ending 
final-ending I nominal-ending 
verbalizer -> prefinal-ending 
final-ending I nominal-ending 
I-pestpouition -> prefinal-ending 
final-ending \[ nominal-ending 
postposition -> End 
prefinal-ending -> final-ending 
I nominal-ending 
final-ending -> End 
nominal-ending -> End \] postposition 
I I-postposition 
The right arrow '-}' indicates that a class on its left 
side can continue with one of classes on its right side; 
a vertical bar '\[' indicates OH,. 
TWO-LEVEL RULES AND FINITE 
STATE AUTOMATA 
Based on tile work of Korean morphology by Lcc 
(1991), 52 two-level rulcs has been developed for the 
Korcan morphological alteruations. By way of an ex- 
ample, we explain the following Korean morphological 
al;ernation in the two-level framework. 
In Korcan, some verbals cnding in the final conso- 
nant B are irregular. The final consonant B of the 
stein, when followed by a vowel, is changed into w. 
But it is not changed when followed by a consonant. 
For example, when an irregular verb doB (to help) is 
combined with the suftix A, it is changed into dowa. 
hi the two-levd system, it is represented as follows: 
Lczical Representation: d o B $ + A 
SuTface Representation: d o w 0 0 a 
537 
This shows a correspondence between lexical repre- 
sentation and surface representation. In PC-KIMMO, 
such a correspondence is represented with the notation 
lezieal-eharacter:surface-eharacter like d:d, o:o, B:w, 
8:0, +:0, and A:a. IIerc the lexieal character 8 is a 
signal indicating that a basic word or stem followed 
by it is irregular, and it corresponds to a surface O 
(the NULL symbol) which is not printed in the output 
form. The lexical 4- (a morpheme boundary symbol) 
also corresponds to a surface 0. 
The above alternation may be described as the fol- 
lowing two-level rule: 
B:w ¢~ --- 8:0 4":0 A:@ (11 Variation lgule) 
This rule stales Lhat a lexical 11 is realized as a 
surface w if and only if it is followed by the conjuga- 
tion information 8, thc morpheme boundary 4", and a 
linking suflix A. A surface @ in the above rule stands 
for any alphabetic charactcr that constitutes a feasible 
pair with a lexical A. For example, the surface @ may 
bc realized ms a, c, or O whcrt all feasible pairs with 
lcxicM A arc like A:a, Arc, and A:O. 
The two-level rules cart be automatically translated 
into the state transition tables by using a rule compiler 
such as TWOL (Karttunen, 1987) and KGEN (Miles, 
1991). The tables built by KGEN may bc actually 
used in PC-KIMMO. The above rule is translated by 
KGEN into the state transition table below: 
1: 
2. 
3. 
4. 
5: 
6: 
7: 
11 11 8 4" A @ (lcxieal charaelers) 
w @ 0 O @ @ (~urfacc characters) 
2 5 1 1 1 1 
0 0 3 O 0 0 
0 0 O 4 0 0 
0 O 0 0 1 O 
2 5 6 1 i 1 
2 5 I 7 1 1 
2 5 1 1 0 1 
The rows of the table represent the seven states, in 
which linal states are marked with colons and nonfinal 
states arc marked with periods. The columns repre- 
sent arcs frorn one state to another. A zero transition 
indicates that there is no valid transition from that 
state for that input symbol. 
CONCLUSION 
We have shown that the two-level morphology model, 
which has bccn claimed to be inappropriate for Ko- 
rean, can be successfully applied to Korean. That 
is, we have implemented a successful two-level mor- 
phology system for Korean (see APPENDIX). This 
system was ba~cd on PC-KIMMO which is a share- 
ware. Itowever, we modified the I/O component of 
PC-KIMMO to handle Korean alphabet HANGUL; 
we have constructed a Korean dictionary for the PC- 
KIMMO, which contains about 12,000 entries; we rep- 
resented a Korean morphotactics for the PC-KIMMO, 
which indicates the morphological structures of known 
words; we wrote 52 two-level spelling rules for the 
PG-KIMMO, which rccovcr almost all spelling alter- 
nations in Korean morphology. 
Our two-level system has been experimented with 
2,172 randomly words selected from Korean textbooks 
(413,975 words) for elementary education. For this 
test set, the system produces the correct outputs al- 
though it includes about 5% extra incorrect analyses 
(i.e., overgeneration). IIcre the overgeneration is as- 
cribed to the fact that it results from the weak ex- 
pressive power of morphotactic information in PC- 
KIMMO. 
REFERENCES 
Ahn, S. C. (1985). The Interplay of Phonology and 
Morphology in Korean. Ph.D. Thesis, Univ. of Illi- 
nois. 
Antworth, E. L. (1990). PC-KIMMO: A Two-Level 
Processor for Morphological Analysis. Summer Insti- 
tute of Linguistics. 
Barton, G. E. (1986). Computational Complexity in 
Two-Level Morphology. In Proceedings of the 2,4th An- 
nual Meeting of Association for Computational Lin- 
guistics, pp. 53-59. 
Kang, S. S. and Y. T. Kim (1992), A Computational 
Analysis Model of Irregular Verbs in Korean Morpho- 
logical Analyzer. Journal of Korea Information Sci- 
ence Soeiely, 19:2, pp. 151-164. (in Korean) 
538 
Karttuncn, L., K. Koskemlicmi, and 1{.. M. KaI,lau 
(1987). A Compile.r for Two-Level l"honological It+des. 
Xerox PMo Alto Research Center and Center for tile 
Study of Language mid hdbrmation. 
Koskennicmi, K. (1983). Two-Level Morphology: A 
Geucral Uompulalioual Modcl for Word-l,'orm ltccog- 
nition and Production. Ph.D. Thesis, Univ. of 
tIelMuki. 
Kwon, II. C. and Y. S. Chae (1991). A Dictionary- 
based Morphological Analysis. In Proceedings of Nal- 
'arm Lauguagc I'roccssin 9 l'acific Rim Symposivm, pp. 
178 185. 
Lee, 11. S. and B. H. Ahn (1991). LcchLr'c on IlANGUL 
Orlhoy'raphy. Shin--Koo Press, Seoul. (in Korean) 
Miles, N. and 9. Antworth (1991). l'relimiuary Doc- 
umemIation fin" KGEN - a ruh" compih!r for PC- 
ffiMMO -. Summcr lnstit, uLe of Linguistics. 
II, it;chie, G. D., G. a. lhmsell, A, W. Black, and S. (J. 
l'uhn~n (:1991). Uompulalional Morl, hology: \]'tacti- 
cal Mcchaui~m,s \]or lhc Engli~ h Lea:icon. MIT Press, 
Cambridge. 
Sproat, It. (1992). Morphology and Computation. 
MIT Press, Cambridge. 
APPENDIX: Running Examples 
Lexicon Verbal 
Lexicon Ending 
Lexicon Postposition 
Lexicon Dthers 
Lexicon End 
PC-KIMMD> recognize 
rccogmzer>> 
doB$+A 
recognizer>> 
il_$+A 
recogmzer>> 
-~$+A 
ol e_$+A 
ha$+AV++da ~}$+A a~ ++r.\]- 
rccogmzer>> Nr-I- 
ha$+AVq-+da *I-$ +A-V, +-kr-.\]- 
recogmzer>> 
haGgyo-kgse 
recogmzer>> 
juN 
ju++N 
juL++N 
~--++,- 
~++ 
recognizer>> ~1~}~ 
* piq-haSq-da ~h'~}$+~-ff 
2784 entries 
94 entries 
1443 entries 
32 entries 
i entries 
\[w+v~\] 
\[w+v~\] 
\[VB+I'E+FE\] 
\[V~+P~+FE\] 
\[NN+PI'\] 
INN\] 
\[VB + e~\] 
\[NN+Vlt+V~:\] 
dbkim/cuking> pcki~mto 
PC-KIMMO TWO-LEVEL I'I~.OCESSOR 
Version 1.0.5, Copyright 1992 SIL 
Type 7 for hel 
PC-KIMMU> load rule kor.rul 
Rules being loaded from kor.rul 
52 Rules Loaded 
PC-KIMMO> load lexicon kor.lex 
Lexicons being loaded fz+om kor.lex 
Lexicon Start I entries 
Lexicon Nominal 7973 entries 
Lexicon Adverb 20 entries 
~39 
