Probabilistic Models for Korean Morphological Analysis
Do-Gil Lee and Hae-Chang Rim
Dept. of Computer Science & Engineering
Korea University
1, 5-ka, Anam-dong, Seongbuk-gu
Seoul 136-701, Korea
CUdglee, rimCV@nlp.korea.ac.kr
Abstract
This paper discusses Korean morpho-
logical analysis and presents three
probabilistic models for morphological
analysis. Each model exploits a distinct
linguistic unit as a processing unit. The
three models can compensate for each
other’s weaknesses. Contrary to the
previous systems that depend on man-
ually constructed linguistic knowledge,
the proposed system can fully automat-
ically acquire the linguistic knowledge
from annotated corpora (e.g. part-of-
speech tagged corpora). Besides, with-
out any modification of the system, it
can be applied to other corpora having
different tagsets and annotation guide-
lines. We describe the models and
present evaluation results on three cor-
pora with a wide range of conditions.
1 Introduction
This paper discusses Korean morphological anal-
ysis. Morphological analysis is to break down
an Eojeol
1
into morphemes, which is the smallest
meaningful unit. The jobs to do in morphological
analysis are as follows:
AF Separating an Eojeol into morphemes
AF Assigning the morpho-syntactic category to
each morpheme
1
Eojeol is the surface level form of Korean and is the
spacing unit delimited by a whitespace.
AF Restoring the morphological changes to the
original form
We have to consider some difficult points in Ko-
rean morphology: there are two kinds of ambigu-
ities (segmentation ambiguity and part-of-speech
ambiguity). Moreover, morphological changes to
be restored are very frequent. In contrast to part-
of-speech (POS) tagging, morphological analysis
is characterized by producing all the (grammati-
cally) regal interpretations. Table 1 gives exam-
ples of morphological analysis for Eojeols “na-
neun” and “gam-gi-neun”.
Previous works on morphological analysis de-
pends on manually constructed linguistic knowl-
edge such as morpheme dictionary, morphosyn-
tactic rules, and morphological rules. There are
two major disadvantages in this approach:
AF Construction of the knowledge base is time-
consuming and labor-intensive. In addition,
storing every word in a lexicon is impossi-
ble so the previous approch suffers from the
unknown word problem.
AF There is a lack of portability. Because the re-
sults produced by a morphological analyzer
are limited to the given tagset and the anno-
tation guidelines, it is very difficult to apply
the system to other tagsets and guidelines.
The proposed morphological analyzer,
ProKOMA, tries to overcome these limitations:
Firstly, it uses only POS tagged corpora as an
information source and can automatically acquire
a knowledge base from these corpora. Hence,
there is no necessity for the manual labor in con-
structing and maintaining such a knowledge base.
197
Table 1: Examples of morphological analysis
na-neun gam-gi-neun
na/np+neun/jx ‘I am’ gam-gi/pv+neun/etm ‘be wound’
na/pv+neun/etm ‘to sprout’ gam-gi/nc+neun/jx ‘a cold is’
nal/pv+neun/etm ‘to fly’ gam/pv+gi/etn+neun/jx ‘to wash is’
Although constructing such corpora also requires
a lot of efforts, the amount of annotated corpora
is increasing every year. Secondly, regardless
of tagsets and annotation guidelines, it can be
applied to any training data without modification.
Finally, it can provide not only analyzed results
but also their probabilities by the probabilistic
models. In Korean, no attempt has been made at
probabilistic approach to morphological analysis.
Probabilities enable the system to rank the results
and to provide the probabilities to the next
module such as POS tagger.
2 Related works
Over the past few decades, a considerable number
of studies have been made on Korean morpho-
logical analysis. The early studies concentrated
on the algorithmic research. The following ap-
proaches belong to this group: longest matching
algorithm, tabular parsing method using CYK al-
gorithm (Kim, 1986), dictionary based approach
(Kwon, 1991), two-level morphology(Lee, 1992),
and syllable-based approach (Kang and Kim,
1994).
Next, many studies have been made on im-
proving the efficiency of the morphological an-
alyzers. There have been studies to reduce the
search space and implausible interpretations by
using characteristics of Korean syllables (Kang,
1995; Lim et al., 1995).
There have been no standard tagset and anno-
tation guideline, so researchers have developed
methods with their own tagsets and guidelines.
The Morphological Analysis and Tagger Evalua-
tion Contest (MATEC) took place in 1999. This is
the first trial about the objective and relative eval-
uation of morphological analysis. Among the par-
ticipants, some newly implemented the systems
and others converted the existing systems’ results
through postprocessing steps. In both cases, they
reported that they spent much effort and argued
the necessity of tuning the linguistic knowledge.
All the systems described so far can be con-
sidered as the so called dictionary and rule based
approach. In this approach, the quality of the dic-
tionary and the rules govern the system’s perfor-
mance.
The proposed approach is the first attempt to
probabilistic morphological analysis. The aim of
the paper is to show that this approach can achieve
comparable performances with the previous ap-
proaches.
3 Probabilistic morphological analysis
model
Probabilistic morphological analysis generates all
the possible interpretations and their probabilities
for a given Eojeol DB. The probability that a given
Eojeol DB is analyzed to a certain interpretation CA
is represented as C8B4CACYDBB5. The interpretation CA
is made up of a morpheme sequence C5 and its
corresponding POS sequence CC as given in Equa-
tion 1.
C8B4CACYDBB5BPC8B4C5BNCCCYDBB5 (1)
In the following subsections, we describe the
three morphological analysis models based on
three different linguistic units (Eojeol, mor-
pheme, and syllable
2
).
3.1 Eojeol-unit model
For the Eojeol-unit model, it is sufficient to store
the frequencies of each Eojeol (surface level
form) and its interpretation acquired from the
POS tagged corpus
3
.
The probabilities of Equation 1 are estimated
by the maximum likelihood estimator (MLE) us-
ing relative frequencies in the training data.
2
In Korean written text, each character has one syllable.
We do not distinguish between character and syllable in this
paper.
3
ProKOMA extracts only Eojeols occurred five times or
more in training data.
198
The most prominent advantage of the Eojeol-
unit analysis is its simplicity. As mentioned be-
fore, morphological analysis of Korean is very
complex. The Eojeol-unit analysis can avoid such
complex process so that it is very efficient and
fast. Besides, it can reduce unnecessary results
by only producing the interpretations that really
appeared in the corpus. So, we also expect an im-
provement in accuracy.
Due to the high productivity of Korean Eojeol,
the number of possible Eojeols is very large so
storing all kinds of Eojeols is impossible. There-
fore, using the Eojeol-unit analysis alone is unde-
sirable, but a small number of Eojeols with high
frequency can cover a significant portion of the
entire ones, thus this model will be helpful.
3.2 Morpheme-unit model
As discussed, not all Eojeols can be covered by
the Eojeol-unit analysis. The ultimate goal of
morphological analysis is to recognize every mor-
pheme within an Eojeol. For these reasons, most
previous systems have used morpheme as a pro-
cessing unit for morphological analysis.
The morpheme-unit morphological analysis
model is derived as follows by introducing lexi-
cal form D0:
C8B4CACYDBB5 BP C8B4D0CYDBB5C8B4CACYD0BNDBB5 (2)
where D0 should satisfy the following condition:
D0 BE C4
DB
CKC8B4D0CYCAB5BPBD
where C4
DB
is a set of lexical forms that can be de-
rived from the surface form DB. This condition
means that among all possible lexical forms for
agivenDB (C4
DB
), the only lexical form D0 is deter-
ministically derived from the interpretation CA.
C8B4D0CYDBB5C8B4CACYD0BNDBB5 AP C8B4D0CYDBB5C8B4CACYD0B5 (3)
AP C8B4D0CYDBB5C8B4CAB5 (4)
BP C8B4D0CYDBB5C8B4C5BNCCB5 (5)
Equation 3 assumes the interpretation CA and the
surface form DB are conditionally independent
given the lexical form D0. Since the lexical form
D0 is underlying in the morpheme sequence C5
4
,
4
A lexical form is just the concatenation of morphemes.
the lexical form D0 can be omitted as in equation 4.
In Equation 5, the left term C8B4D0CYDBB5 denotes “the
morphological restoration model”, and the right
C8B4C5BNCCB5 “the morpheme segmentation and POS
assignment model”.
We describe the morphological restoration
model first. The model is the probability of the
lexical form given a surface form and is to encode
the probability that the CZ substrings between the
surface form and its lexical form correspond to
each other. The equation of the model is as fol-
lows:
C8B4D0CYDBB5 AP
CZ
CH
CYBPBD
C8B4CMD7
CY
CYD7
CY
B5 (6)
where, D7
CY
and CMD7
CY
denote the CYth substrings of the
surface form and the lexical form, respectively.
We call such pairs of substrings “morpholog-
ical information”. This information can be ac-
quired by the following steps: If a surface form
(Eojeol) and its lexical form are the same, each
syllable pair of them is mapped one-to-one and
extracted. Otherwise, it means that a morpholog-
ical change occurs. In this case, the pair of two
substrings from the beginning to the end of the
mismatch is extracted. The morphological infor-
mation is also automatically extracted from train-
ing data. Table 2 shows some examples of apply-
ing the morphological restoration model.
Now we turn to the morpheme segmentation
and POS assignment model. It is the joint prob-
ability of the morpheme sequence and the tag se-
quence.
C8B4C5BNCCB5 BP C8B4D1
BDBND2
BND8
BDBND2
B5
AP
D2
CH
CXBPBD
C8B4D1
CX
CYD8
CX
B5C8B4D8
CX
CYD8
CXA0BD
B5
A2C8B4D8
BXC7CF
CYD8
D2
B5 (7)
In equation 7, D8
BC
and D8
BXC7CF
are pseudo tags to
indicate the beginning and the end of Eojeol, re-
spectively. We introduce the D8
BXC7CF
symbol to
reflect the preference for well-formed structure
of a given Eojeol. The model is represented
as the well-known bigram hidden Markov model
(HMM), which is widely used in POS tagging.
The morpheme dictionary and the morphosyn-
tactic rules that have been used in the previous
199
Table 2: Examples of applying the morphological restoration model
Surface form Lexical form Probability Description
na-neun na-neun C8B4naCYnaB5C8B4neunCYneunB5 No phonological change
na-neun nal-neun C8B4nalCYnaB5C8B4neunCYneunB5 ‘l’ irregular conjugation
go-ma-wo go-mab-eo C8B4goCYgoB5C8B4mab-eoCYma-woB5 ‘b’ irregular conjugation
beo-lyeo beo-li-eo C8B4beoCYbeoB5C8B4li-eoCYlyeoB5 Contraction
ga-seo ga-a-seo C8B4gaCYgaB5C8B4a-seoCYseoB5 Ellipsis
approaches are included in the lexical probabil-
ity C8B4D1
CX
CYD8
CX
B5 and the transition probability C8B4D8
CX
CY
D8
CXA0BD
B5.
3.3 Syllable-unit model
One of the most difficult problems in morphologi-
cal analysis is the unknown word problem, which
is caused by the fact that we cannot register every
possible morpheme in the dictionary. In English,
contextual information and suffix information is
helpful to estimate the POS tag of an unknown
word. In Korean, the syllable characteristics can
be utilized. For instance, a syllable “eoss” can
only be a pre-final ending.
The syllable-unit model is derived from Equa-
tion 4 as follows:
C8B4D0CYDBB5C8B4CAB5BPC8B4D0CYDBB5C8B4BVBNCDB5 (8)
where BV BP CR
BDBND1
is the syllable sequence of the
lexical form, and CD BP D9
BDBND1
is its corresponding
syllable tag sequence.
In the above equation, C8B4D0 CY DBB5 is the same
as that of the morpheme-unit model (Equation 6),
we use the morpheme-unit model’s result as it is.
The right term C8B4BVBNCDB5 is referred to as “the POS
assignment model”.
The POS assignment model is to assign the D1
syllables to the D1 syllable tags:
C8B4BVBNCDB5 BP C8B4CR
BDBND1
BND9
BDBND1
B5 (9)
AP
D1
CH
CXBPBD
AW
C8B4CR
CX
CYCR
CXA0BEBNCXA0BD
BND9
CXA0BEBNCXA0BD
B5
C8B4D9
CX
CYCR
CXA0BDBNCX
BND9
CXA0BEBNCXA0BD
B5
AX
A2C8B4CR
BXC7CF
CYCR
D1A0BDBND1
BND9
D1A0BDBND1
B5
A2C8B4D9
BXC7CF
CYCR
D1
BNCR
BXC7CF
BND9
D1A0BDBND1
B5(10)
In Equation 10, when CX is less than or equal to
zero, CR
CX
s and D9
CX
s denote the pseudo syllables and
the pseudo tags, respectively. They indicate the
beginning of Eojeol. Analogously, CR
BXC7CF
and
D9
BXC7CF
denote the pseudo syllables and the pseudo
tags to indicate the end of Eojeol, respectively.
Two Markov assumptions are applied in Equa-
tion 10. One is that the probability of the current
syllable CR
CX
conditionally depends only on the pre-
vious two syllables and two syllable tags. The
other is that the probability of the current syllable
tag D9
CX
conditionally depends only on the previous
syllable, the current syllable, and the previous two
syllable tags. This model can consider broader
context by introducing the less strict independent
assumption than the HMM.
In order to convert the syllable sequence BV and
the syllable tag sequence CD to the morpheme se-
quence C5 and the morpheme tag sequence CC,we
can use two additional symbols (“B” and “I”) to
indicate the boundary of morphemes: a “B” de-
notes the first syllable of a morpheme and an “I”
any non-initial syllable. Examples of syllable-
unit tagging with BI symbols are given in Table
3.
4 Experiments
4.1 Experimental environment
For evaluation, three data sets having different tag
sets and annotation guidelines are used: ETRI
POS tagged corpus, KAIST POS tagged corpus,
and Sejong POS tagged corpus. All experiments
were performed by the 10-fold cross-validation.
Table 4 shows the summary of the corpora.
Table 4: Summary of the data
Corpus ETRI KAIST Sejong
# of Eojeols 288,291 175,468 2,015,860
# of tags 27 54 41
In this paper, we use the following measures in
order to evaluate the system:
200
Table 3: Examples of syllable tagging with BI symbols
Eojeol na-neun ‘I’ hag-gyo-e ‘to school’ gan-da ‘go’
Tagged Eojeol na/np+neun/jx hag-gyo/nc+e/jc ga/pv+n-da/ef
Morpheme na neun hag-gyo e ga n-da
Morpheme tag np jx nc jc pv ef
Syllable na neun hag gyo e ga n da
Syllable tag B-np B-jx B-nc I-nc B-jc B-pv B-ef I-ef
Answer inclusion rate (AIR) is defined as the
number of Eojeols among whose results con-
tain the gold standard over the entire Eojeols
in the test data.
Average ambiguity (AA) is defined as the aver-
age number of returned results per Eojeol by
the system.
Failure rate (FR) is defined as the number of
Eojeols whose outputs are not produced over
the number of Eojeols in the test data.
1-best tagging accuracy (1A) is defined as the
number of Eojeols of which only one inter-
pretation with highest probability per Eojeol
is matched to the gold standard over the en-
tire Eojeols in the test data.
There is a trade-off between AIR and AA. If
a system outputs many results, it is likely to in-
clude the correct answer in them, but this leads
to an increase of the ambiguity, and vice versa.
The higher AIR is, the better the system. The
AIR can be an upper bound on the accuracy of
POS taggers. On the contrary to AIR, the lower
AA is, the better the system. A low AA can re-
duce the burden of the disambiguation process of
the POS tagger. Although the 1A is not used as
a common evaluation measure for morphological
analysis because previous systems do not rank the
results, ProKOMA can be evaluated by this mea-
sure because it provides the probabilities for the
results. This measure can also be served as a base-
line for POS tagging.
4.2 Experimental results
To investigate the performance and the effective-
ness of the three models, we conducted several
tests according to the combinations of the mod-
els. For each test, we also performed the exper-
iments on the three corpora. The results of the
experiments are listed in Table 5. In the table,
“E”, “M”, and “S” mean the Eojeol-unit analysis,
the morpheme-unit analysis, and the syllable-unit
analysis, respectively. The columns having more
than one symbol mean that each model performs
sequentially.
According to the results, when applying a sin-
gle model, each model shows the significant dif-
ferences, especially between “E” and “S”. Be-
cause of low coverage of the Eojeol-unit analy-
sis, “E” shows the lowest AIR and the highest FR.
However, it shows the lowest AA because it pro-
duces the small number of results. On the con-
trary, “S” shows the highest AA but the best per-
formances on AIR and FR, which is caused by
producing many results.
Most previous systems use morpheme as a
processing unit for morphological analysis. We
would like to examine the effectiveness of the
proposed models based on Eojeol and syllable.
First, compare the models that use the Eojeol-
unit analysis with others (“M” vs. “EM”, “S” vs.
“ES”, and “MS” vs. “EMS”). When applying the
Eojeol-unit analysis, AA is decreased, and AIS
and 1A are increased. Then, compare the mod-
els that use the syllable-unit analysis with others
(“E” vs. “ES”, “M” vs. “MS”, and “EM” vs.
“EMS”). When applying the syllable-unit anal-
ysis, AIR and 1A are increased, and FR is de-
creased. Therefore, both models are very useful
when compared the morpheme-unit model only.
Compared with the performances of two sys-
tems that participated in MATEC 99, we listed
the results in Table 6. In this evaluation, the
ETRI corpus was used and the number of Eo-
jeols included in the test data is 33,855. The
evaluation data used in MATEC 99 and ours
are not the same, but are close. As can be
201
Table 5: Experimental results according to the combination of the processing units
Data Measure E M S EM ES MS EMS
ETRI
Answer inclusion rate (%) 54.65 93.87 98.91 94.16 98.81 97.09 97.22
Average ambiguity 1.23 2.63 6.95 2.10 4.46 2.95 2.41
Failure rate (%) 45.21 3.81 0.06 3.67 0.06 0.06 0.06
1-best accuracy (%) 51.66 83.49 89.98 86.41 91.22 86.15 88.92
KAIST
Answer inclusion rate (%) 57.43 94.29 98.36 94.41 98.25 96.97 97.02
Average ambiguity 1.26 1.84 6.05 1.57 3.80 2.16 1.89
Failure rate (%) 42.40 3.73 0.06 3.67 0.06 0.06 0.06
1-best accuracy (%) 54.22 87.51 90.02 89.18 91.02 89.50 91.12
Sejong
Answer inclusion rate (%) 67.79 90.96 99.38 92.17 99.33 96.60 97.09
Average ambiguity 1.29 2.35 6.60 1.82 3.52 2.72 2.17
Failure rate (%) 32.13 5.94 0.02 5.21 0.02 0.02 0.02
1-best accuracy (%) 64.64 83.86 91.56 87.00 92.96 88.72 91.16
Table 6: Performances of two systems participated in MATEC 99
(Lee et al., 1999)’s system (Song et al., 1999)’s system
Answer inclusion rate (%) 98 92
Average ambiguity 4.13 1.75
seen, the Lee et al. (1999)’s system is better than
ProKOMA in terms of AIS, but it generates too
many results (with higher AA).
5 Conclusion
We have presented and described the new prob-
abilistic models used in our Korean morpholog-
ical analyzer ProKOMA. The previous systems
depend on manually constructed linguistic knowl-
edge such as morpheme dictionary, morphosyn-
tactic rules, and morphological rules. The system,
however, requires no manual labor because all the
information can be automatically acquired by the
POS tagged corpora. We also showed that the sys-
tem is portable and flexible by the experiments on
three different corpora.
The previous systems take morpheme as a pro-
cessing unit, but we take three kinds of processing
units (e.g. Eojeol, morpheme, and syllable). Ac-
cording to the experiments, we can know that the
Eojeol-unit analysis contributes efficiency and ac-
curacy, and the syllable-unit analysis is robust in
the unknown word problem and also contributes
accuracy. Finally, the system achieved compara-
ble performances with the previous systems.
References
S.-S. Kang and Y.-T. Kim. 1994. Syllable-based
model for the Korean morphology. In Proceedings
of the 15th International Conference on Computa-
tional Linguistics, pages 221–226.
S.-S. Kang. 1995. Morphological analysis of Ko-
rean irregular verbs using syllable characteristics.
Journal of the Korea Information Science Society,
22(10):1480–1487.
S.-Y. Kim. 1986. A morphological analyzer for
Korean language with tabular parsing method and
connectivity information. Master’s thesis, Dept.
of Computer Science, Korea Advanced Institute of
Science and Technology.
H.-C. Kwon. 1991. Dictionary-based morphological
analysis. In Proceedings of the Natural Language
Processing Pacific Rim Symposium, pages 87–91.
S.-Z. Lee, B.-R. Park, J.-D. Kim, W.-H. Ryu, D.-G.
Lee, and H.-C. Rim. 1999. A predictive morpho-
logical analyzer, a part-of-speech tagger based on
joint independence model, and a fast noun extractor.
In Proceedings of the MATEC 99, pages 145–150.
S.-J. Lee. 1992. A two-level morphological analy-
sis of Korean. Master’s thesis, Dept. of Computer
Science, Korea Advanced Institute of Science and
Technology.
H.-S. Lim, S.-Z. Lee, and H.-C. Rim. 1995. An ef-
ficient Korean mophological analysis using exclu-
sive information. In Proceedings of the 1995 In-
ternational Conference on Computer Processing of
Oriental Languages, pages 255–258.
T.-J. Song, G.-Y. Lee, and Y.-S. Lee. 1999. Mor-
phological analyzer using longest match method for
syntactic analysis. In Proceedings of the MATEC
99, pages 157–166.
202
