Generalized unknown morpheme guessing for hybrid POS tagging 
of Korean* 
Jeongwon Cha and Geunbae Lee and Jong-Hyeok Lee 
Department of Computer Science & Engineering 
Pohang University of Science & Technology 
Pohang, Korea 
{himen, gblee, jhlee}@postech.ac.kr 
Abstract 
Most of errors in Korean morphological anal- 
ysis and POS (Part-of-Speech) tagging are 
caused by unknown morphemes. This paper 
presents a generalized unknown morpheme han- 
dling method with P OSTAG (POStech TAGger) 
which is a statistical/rule based hybrid POS 
tagging system. The generalized unknown mor- 
pheme guessing is based on a combination of 
a morpheme pattern dictionary which encodes 
general lexical patterns of Korean morphemes 
with a posteriori syllable tri-gram estimation. 
The syllable tri-grams help to calculate lexical 
probabilities of the unknown morphemes and 
are utilized to search the best tagging result. 
In our scheme, we can guess the POS's of un- 
known morphemes regardless of their numbers 
and positions in an eojeol, which was not possi- 
ble before in Korean tagging systems. In a se- 
ries of experiments using three different domain 
corpora, we can achieve 97% tagging accuracy 
regardless of many unknown morphemes in test 
corpora. 
1 Introduction 
Part-of-speech (POS) tagging has many difficult 
problems to attack such as insufficient training 
data, inherent POS ambiguities: and most se- 
riously unknown words. Unknown words are 
ubiquitous in any application and cause major 
tagging failures in many cases. Since Korean 
is an agglutinative language, we have unknown 
morpheme problems instead of unknown words 
in our POS tagging. 
The usual way of unknown-morpheme han- 
dling before was to guess possible POS's for an 
unknown-morpheme by checking connectable 
" This project was supported by KOSEF (teukjeongki- 
cho #970-1020-301-3, 1997). 
functional morphemes in the same eojeol l 
(Kang, 1993). In this way, they could guess 
possible POS's for a single unknown-morpheme 
only when it is positioned in the begining of 
an eojeol. If an eojeol contains more than one 
unknown-morphemes or if unknown-morphemes 
appear other than the first position, all the 
previous methods cannot efficiently estimate 
them. sO, we propose a morpheme-pattern 
dictionary which enables us to treat unknown- 
morphemes in the same way as registered known 
morphemes, and thereby to guess them regard- 
less of their numbers and positions in an eo- 
jeol. The unknown-morpheme handling using 
the morpheme-pattern dictionary is integrated 
into a hybrid POS disambiguation. 
The POS disambiguation has usually been per- 
formed by statistical approaches mainly using 
hidden markov model (HMM) (Cutting et al., 
1992; Kupiec. 1992; Weischedel et al., 1993). 
However. since statistical approaches take into 
account neighboring tags only within a limited 
window (usually two or three), sometimes the 
decision cannot cover all linguistic contexts nec- 
essary for POS disambiguation. Also the ap- 
proaches are inappropriate for idiomatic expres- 
sions for which lexical terms need to be directly 
referenced. The statistical approaches are not 
enough especially for agglutinative languages 
(such as Korean) which have usually complex 
morphological structures. In agglutinative lan- 
guages, a word (called eojeol in Korean) usu- 
ally consists of separable single stem-morpheme 
plus one or more functional morphemes, and the 
POS tag should be assigned to each morpheme 
to cope with the complex morphological phe- 
nomena. Recently, rule-based approaches are 
tAn eojeol is a Korean spacing unit(similar to En- 
glish word) which usually consists of one or more stem 
morphemes and functional morphemes. 
85 
re-studied to overcome the limitations of staffs- 
tical approaches by learning symbolic tagging 
rules automatically from a corpus (Brill, 1992; 
Bril!. 1994). Some systems even perform the 
POS tagging as part of a syntactic analysis pro- 
cess (Voutilainen, 1995). However, rule-based 
approaches alone, in general, are not very ro- 
bust, and not portable enough to be adjusted 
to new tag sets and new languages. Also the. 
performance is usually no better than the sta- 
tistical counterparts (Brill, 1992). To gain the 
portability and robustness and also to overcome 
the limited coverage of statistical approaches, 
we adopt a hybrid method that can combine 
both statistical and rule-based approaches for 
POS disambiguation. 
2 Linguistic characteristics of 
Korean 
Korean is classified as an agglutinative lan- 
guage in which ~.n eojeol consists of several 
number of morphemes that have clear-cut mor- 
pheme boundaries. For examples. 'Q~ ~'~Tl~l 
~t r-l-(I caught a cold)" consists of 3 eojeols and 
7 morpheme, such as2 Q(I)/T + ~(au:~iliary 
particle)/jS, %1-71(cold)/MC + oil(other parti- 
cle)/jO, ~ e.l(catch)/DR + N(past tense)/eGS 
+ c\]-(final ending)/eGE. Below are the charac- 
teristics of Korean that must be considered for 
morphological-level natural language processing 
and POS tagging. 
As an agglutinative language, Korean 
POS tagging is usually performed on 
a morpheme basis rather than an eo- 
jeol basis. So, morphological analysis 
is essential to POS tagging because 
morpheme segmentation is much more 
important and difficult than POS assign- 
ment. Moreover: morphological analysis 
should segment out unknown morphemes 
as well as known morphemes, so un- 
known morpheme handling should be 
integrated into the morphological analysis 
process. There are three possible analyses 
fl'om the eojeol "Q~" : 'Q(I)/T' + 
' ~(subject-marker)/jS', ~ ut-(sprout)/DR' 
+ '~(adnominal)/eCNMG', ?~(fly)/DI' 
+ '~(adnominal)/eCNMG', so morpheme 
"~Here, '+' is a morpheme boundary in an eojeol and 
'/' is for the POS tag symbols (see Fig. 1). 
segmentation is often ambiguous. 
• Korean is a postpositional language with 
many kind of noun-endings (particles), 
verb-endings (other endings), and prefinal 
verb-endings (prefinal endings). It is these 
functional morphemes, rather than eojeol's 
order, which determine most of the gram- 
matical relations such as noun's syntactic 
flmctions, verb's tense, aspect, modals, and 
even modi~ing relations between eojeots. 
For example. ~/jS' is an atuxiliary parti- 
cle, so eojeol ~'G-~-" has a subject role due 
to the particle :~/jS'. 
• Complex spelling changes frequently occur 
between morphemes when two morphemes 
combine to form an eojeol. These spelling 
changes make it difficult to segment the 
original morphemes out before assigning 
the POS tag symbols. 
Fig. 1 shows a tag set extracted from 100 full 
POS tag hierarchies in Korean. This tag set will 
be used in our experiments in section 6. 
3 Unknown morpheme guessing 
during morphological analysis 
Morphological analysis is a basic step to natural 
language processing which segments input texts 
into morphotactically connectable morphemes 
and assigns all possible POS tags to each mor- 
pheme by looking up a morpheme dictionary. 
Our morphological analysis follows general 
three steps (Sproat, 1992): morpheme seg- 
mentation, original morpheme recovery from 
spelling changes, and morphotactics modeling. 
Input texts are scanned from left to right., 
character3by character, to be matched to mor- 
phemes in a morpheme dictionary. The mor- 
pheme dictionary (Fig. 2) has a separate en- 
try for each variant form (called allomorph) of 
the original morpheme form so we can easily re- 
construct the original inorphemes from spelling 
changes. 
For morphotactics modeling, we used the POS 
tags and the morphotactic adjacency symbols in 
the dictionary. The full hierarchy of POS tags 
and morphotactic adjacency symbols are en- 
coded in the morpheme dictionary for each mor- 
3The character sequence in "~" is 'u.', ' ~ ', ~-', 
86 
tag descrip t ion' tag 
MC 
MPP 
T 
B 
DI 
I 
js 
eGS 
eCNMM 
eCC 
+ 
SO 
S. 
sf 
common noun 
place name 
pronoun 
adverb 
irregular verb 
i-predicative particle 
auxiliary particle 
description tag 
prefinal ending 
nominal ending 
conjunctive ending 
prefix 
other symbol 
sentence closer 
foreign word 
MPN person name 
MPO other proper noun 
G adnoun 
K interjection 
HR regular adjective 
E existential predicate 
jO other particle 
eCNDI ~aux conj ending 
eCNMG adnominal ending 
y predicative particle 
snfllx 
s' left parenthesis 
s- \] sentence connection 
sh i Chinese character 
MPC 
MD 
S 
DR. 
HI 
jC 
eGE 
eCNDC 
eCNB 
b 
Su 
S ~ 
S, 
description 
country name 
bound noun 
numeral 
regular verb 
irregular adjective 
case particle 
final ending 
quote conj ending 
adverbial ending 
auxiliary verb 
unit symbol 
right parenthesis 
sentence comma 
Figure 1: A tag set with 41 tags from 100 full hierarchical POS tag symbols 
P0S-tag<0r~ginal form> (allomorph) \[morphotactic adjacency symbols\] 
MCC< 7\]-.~> (71---~) \[@>D aI->H e}> DN >\] 
MCK< ~ -~-> (~ @) \[-8->D 8l->\] 
DIS ~< ~L-t 71-> (~t--t 7\]-) \[-~->-~ %~>\] 
DI u < ~oI-.~> ( ~ o\]..~-) \[e>\] 
DI=<~oF~-> (o~ol-R-) \[R->~>\] 
DI*<~ ~> (~) \[R->':q >\] 
~> HI e < 71--~'> (71-~) \[-~ \] 
HI e < 7l-'~> (7}-~) \[@>¢q >\] 
Figure 2: Morpheme dictionary 
pheme. To model the morpheme's connectabil- 
ity to one another, besides the morpheme dictio- 
nary, the separate morpheme-connectivity table 
encodes all the connectable pairs of morpheme 
groups using the morpheme's tag and morpho- 
tactic adjacency symbol patterns. After an in- 
put eojeol is segmented by trie indexed dictio- 
nary search, the morphological analysis checks if 
each segmentation is grammatically connectable 
by looking into the morpheme-connectivity ta- 
ble. 
For unknown morpheme guessing, we develop a 
general unknown morpheme estimation method 
for number-free and position-free unknown mor- 
pheme handling. Using a morpheme pattern 
dictionary, we can look up unknown morphemes 
in the dictionary exactly same way as we do 
the registered morphemes. And when mor- 
phemes are checked if they are connectable, we 
can use the iEformation of the adjacent mor- 
phemes in the same eojeol. The basic idea of the 
morpheme-pattern dictionary is to collect all 
the possible general lexical patterns of Korean 
morphemes and encode each lexical syllable pat- 
tern with all the candidate POS tags. So we can 
assign initial POS tags to each unknown mor- 
pheme by only matching the syllable patterns 
in the pattern dictionary. In this way, we don't 
need a special rule-based unknown morpheme 
handling module in our morphological analyzer, 
and all the possible POS tags for unknown mor- 
phemes can be assigned just like the registered 
morphemes. This method can guess the POS 
of each and every unknown morpheme, if more 
than one unknown morphemes are in an eojeol, 
regardless of their positions since the morpheme 
segmentation is applied to both the unknown 
morphemes and the registered morphemes dur- 
87 
ing the trie indexed dictionary search. 
3.1 Morpheme pattern dictionary 
The morpheme pattern dictionary covers all 
necessary syllable patterns for unknown mor- 
phemes including common nouns, proper- 
nouns, adnominals, adverbs, regular and irreg- 
ular verbs, regular and irregular adjectives, and 
special symbols for foreign words. The lexical. 
patterns for morphemes are collected from the 
previous studies (Kang, 1993) where the con- 
straints of Korean syllable patterns as to the 
morpheme connectabilities are well described. 
Fig. 3 shows some example entries of the mor- 
pheme pattern dictionary, where ;Z', ;V', ;*' are 
meta characters which indicate a consonant, a 
vowel, and any number of Korean characters 
respectively. For example, ~L..7_-l--g\]" (thanks), 
which is a morpheme and an eojeol at the same 
time, is matched "(ZV*N)" (shown in Fig. 3) in 
the morpheme pattern dictionary, and is recov- 
ered into the original morpheme form "2_~". 
4 A hybrid tagging model 
sentence 
i ~ ................ .:.. l ........... ~: ¢lictloOary ::: morphologiea I 
!!i ...... ~' ~mo, p.ho~ = \] i.¢o~,iCth,~,."l ....... .,. ili 
)-. ':.i tta~t...:..::.)t i analyzer i'i 
~. "~::'::':, tram, re .>-:4 
..... t e'~'~': ~J . ~ 
.:;;:::::Tir~ii~/i. :~~:iil ~'~¥6ii=/iinn|i~?"J ...... ~ statistical ,~! " 
~":"'=";~::1 L P°stagger i?,~ 
r : Ir 
(i! .:=,~','~t1~ ":.::.\] .......... 
~::?i: ~°~ :-:.::":J 
post error- lJ 
corrector ~! 
Figure 4: Statistical and rule-based hybrid ar- 
chitecture for Korean POS tagging. 
Fig. 4 shows a proposed hybrid architecture for 
Korean POS tagging with generalized unknown- 
morpheme guessing. There are three ma- 
jor components: the morphological analyzer 
with unknown-morpheme handler, the statisti- 
cal tagger, and the rule-based error corrector. 
The morphological analyzer segments the mor- 
phemes out of eojeols in a sentence and recon- 
structs the original morphemes from spelling 
changes from irregular conjugations. It also as- 
signs all possible POS tags to each morpheme 
by consulting a morpheme dictionary. The 
unknown-morpheme handler integrated into the 
morphological analyzer assigns the POS's of the 
morphemes which are not registered in tim dic- 
tionary. 
The statistical tagger runs the Viterbi algo- 
rithm (Forney, 1973) on the morpheme graph 
for searching the optimal tag sequence for POS 
disambiguation. For remeding the defects of 
a statistical tagger, we introduce a post error- 
correction mechanism. The error-corrector is a 
rule-based transformer (Brill, 1992), and it cor- 
rects the mis-tagged morphemes by considering 
the lexical patterns and the necessary contex- 
tual information. 
4.1 Statistical POS tagger 
Statistical tagging model has the morpheme 
graph as input and selects the best morpheme 
and POS tag sequence r for sentences repre- 
sented ill the graph. The morpheme-graph is 
a compact way of representing nmltiple mor- 
pheme sequences for a sentence. We put each 
morpheme with the tag as a node and the mor- 
pheme connectivity as a link. 
Our statistical tagging model is adjusted from 
standard bi-grams using the Viterbi-search 
(Cutting et al., 1992) plus on-the-fly extra com- 
puting of lexical probabilities for unknown mor- 
phemes. The equation of statistical tagging 
model used is a modified hi-gram model with 
left to right search: 
n )3Pr(tilrni) T ° = argmaz'T I~ aPr(tiIti-t Pr(ti) 
i=1 (1) 
4A Korean eojeol can be segmented into many differ- 
ent ways, so selecting the best morpheme segmentation 
sequence is as important as selecting the best POS se- 
quence in Korean POS tagging. 
88 
POS-tag<original form> (allomorph) \[morphotactic adjacency symbols\] 
HI~<ZV*~> (ZV*~) \[~>o1>\] 
HI e <ZV* ~> (ZV* 71-) \[~>1 
HI ~ <ZV*ZV n > (ZV*q-) \[~>\] 
HI ~ <ZV*ZV ~a > (ZV**.\]) \[@ %'>\] 
HI <ZV*ZV > (ZV* %,>\] 
DIm<ZV* > (ZV* t) 
DI <ZV* > (ZV*N) 
DI= (ZV*G). 
DI=<ZV*~-> (ZV*~) \[-~>,:q >\] 
Figure 3: Morpheme 
where T" is an optimal tag sequence that max- 
imizes the forward Viterbi scores. Pr(tilti-1) 
is a bi-gram tag transition probability and Pr( ti lrni ) 
PT(td is a modified morpheme lexical prob- 
ability. This equation is finally selected from 
the extensive experiments using the following 
six different equations: 
/x 
T" = argmazr ~I Pr(tilti-:)Pr(miltd (2) 
i=1 
T" = argrnaxr rI °LPr(tilti-t)~3Pr(rnil ti) (3) 
i=1 
I,l 
T" = argrnaxr I~ Pr(tilti-1)Pr(tdrni) (4) 
i=1 
n 
T" = argrnaxr ~I °~Pr(tilti-:)13Pr(tilrni) (5) 
• i=1 
n pr(tilrni) T" = argmaxT rI Pr(tilti-l) Pr(ti) (6) 
i=l 
,.2. ,,,Pr(tilmi) T ° = argrnaxT li otz-'r(ti ti-t)p ~ (7) 
i=1 
In the experiments, we used 10204 morpheme 
training corpus from :'Kemong Encyclopedia 5,. 
Table 1 shows the tagging performance of each 
equation. 
Training of the statistical tagging model re- 
quires parameter estimation process for two pa- 
rameters, that is, morpheme lexical probabil- 
ities and bi-gram tag transition probabilities. 
Several studies show that using as much as 
tagged corpora for training gives much better 
Sprovided from ETRI 
pattern dictionary 
performance than unsupervised training using 
Baum-Welch algorithm (blerialdo, 1994). So we 
decided to use supervised training using tagged 
corpora with relative frequency counts. The 
three necessary probabilities can be estimated 
as follows: 
Pr(tilmi) .~ f(tilmi) = N(mi, ti) (8) 
N(t ) Pr(ti) ~ f(ti) = 41 (9) 
Pr(tilti-1) .~ f(tilti-t) = N(ti-:, ti) N(ti-t) (10) 
where N(mi, ti) indicates the total number of 
occurrences of morpheme .mi together with spe- 
cific tag ti, while N(mi) shows the total number 
.of occurrences of morpheme rni in the tagged 
training corpus. The N(ti_l,ti) and N(ti-1) 
can be interpreted similarly for two consecutive 
tags ti-1 and ti. 
4.2 Lexlcal probability estimation for 
unknown morpheme guessing 
The lexical probabilities for unknown mor- 
phemes cannot be pre-calculated using the 
equation (8), so a special method should be 
applied. We suggest to use syllable tri-grams 
since Korean syllables can duly play important 
roles as restricting units for guessing POS of a Pr(tilmi) 
morpheme. So the lexical probability e,-(td 
for unknown morphemes can be estimated us- 
ing the frequency of syllable tri-gram products 
according to the following formula: 
m = ele2...e~ (11) 
89 
equation 2 equation 3 
eojeol 86.80 90.48 
morpheme 91.32 94.93 
eq:uation 4 
89.40 
94.40 
equation 5 equation 6 
89.62 91.73 
94.48 95.77 
equation 7(equation l) 
92.48 
96.12 
Table 1: Tagging performance of each equation. The a and ,3 are weights, and we set a = 0.4 and 
= 0.6. The eojeol shows eojeol-unit tagging correctness while morpheme shows morpheme-unit 
correctness. 
Pr(tlm) 
Pr(t) ~ Prt(ell#'#)Prt(e21#'el) 
n 
\[I Prt(eilei-2, el-t) 
i=3 
Pr(#1e,,-t, en) (12) 
Prt ( eilei-2, ei-t ) .~ .ft(eilei-2, ei-1) 
+.h(eilei- ) 
+ fft(e{) (13) 
where ~m' is a morpheme, 'e' is a syllable, 't' 
is a POS tag, '#" is a morpheme boundary sym- 
bol, and ft(eilei-~., el-l) is a frequency data for 
tag ~t' with cooccurrence syllables el-2, ei-1, ei. 
A tri-gram probabilities are smoothed by. equa- 
tion (13) to cope with the sparse-data problem. 
For example, "~=1-". o ,_ is a name of a person, so 
is an unknown morpheme. The lexical probabil- 
ity of "~'~'" -, o ,~ as tag MPN is estimated using 
the formula: 
Pr(AfPN) Pr.wpN(~rl#, #) 
xPrMpN('~I#, ~) 
x Pr,~teN(~l~ I', ~) 
x PrMpN (#1 
All tri-grams for Korean syllables were pre- 
calculated and stored in the table, and are ap- 
plied with the candidate tags during the un- 
known morpheme POS guessing and smoothing. 
5 A posteriori error correction rules 
The statistical morpheme tagging covers only 
the limited range of contextual information. 
Moreover, it cannot refer to tile lexical pat- 
terns as a context for POS disambiguation. As 
mentioned before, Korean eojeol has very com- 
plex morphological structure so it is necessary 
to look at the functional morphemes selectively 
to get the grammatical relations between eo- 
jeols. For these reasons, we designed error- 
correcting rules for eojeols to compensate es- 
timation and modeling errors of the statisti- 
cal morpheme tagging. However, designing the 
error-correction rules with knowledge engineer- 
ing is tedious and error-prone. Instead, we 
adopted Brill's approach (Brill, 1992) to auto. 
matically learn the error-correcting rules from 
small amount of tagged corpus. Fortunately, 
Brill showed that we don't need a large amount 
of tagged corpus to extract the symbolic tagging 
rules compared with the case in tile statistical 
tagging. Table 2 shows some rule schemata we 
used to extract the error-correcting rules: where 
a rule schema designates the context of rule ap- 
plications, i.e.. the morpheme position and the 
lexical/tag decision in the context eojeol. 
The rules which can be automati- 
cally learned using table 2's schemata 
are in the form of table 3, where 
\[current eojeol or morpheme\] consists of 
morpheme (with current tag) sequence in the 
eojeol, and \[corrected eojeol or morpheme\] 
consists of morpheme (with corrected tag) se- 
quence ill the same eojeol. For example, the rule 
\[N (Chinese ink)/MC + ~-/jS\]\[N1FT, MC\] --~ 
\[N(to eat)/DR + ~-/eCNMG\] says that 
the current eojeol was statistically tagged as 
common-noun (MC) plus auxiliary particle 
(iS), but when the next first eojeol's (N1) 
first position morpheme tag (FT) is another 
common-noun (MC), the eojeol should be 
tagged as regular verb (DR) plus adnominal 
ending (eCNMG). This statistical error is 
caused from the ambiguity of the morpheme 
"N" which has two meanings as "Chinese ink:' 
(noun) and "to eat" (verb). Since the mor- 
pheme segmentation is very difficult ill Korean, 
many of the tagging errors also come from the 
morpheme segmentation errors. Our error- 
correcting rules call cope with these morpheme 
90 
rule schema 
N1FT 
P1LT 
N2FT 
N3FT 
P1LM 
P1FM 
N1FM 
description 
next first eojeol (N1) first morpheme's tag (FT) 
previous first eojeol (P1) last morpheme's tag (LT) 
next second eojeol (N2) first morpheme's tag (FT) 
next third eojeol (N3) first morpheme's tag (FT) 
previous first eojeol (P1) last morpheme's lexical form (LM) 
previous first eojeol (P1) first morpheme's lexical form (FM) 
next first eojeol (N1) first morpheme's lexical form (FM) 
Table 2: Some rule schemata to extract the error-correcting rules automatically from the tagged 
corpus. POSTAG has about 24 rule schemata in this form. 
\[current eojeol or morpheme\] \[rule schemata, referenced morpheme or tag\] \[ 
\[corrected eojeol or morpheme\] I 
Table 3: Error correction rule format 
segmentation errors by correcting the errors in 
the whole eojeol together. For example, the fol- 
lowing rule can correct morpheme segmentation 
errors: \[~/MC 4- ol.7_./jO\]\[P1LM, @\] -~ \[~ 
o\]/DR + _7-~eCCl. This rule says that 
the eojeol "@old" is usually segmented as 
common-noun "~" (meaning string or rope) 
plus other-particle "o\]..v_.", but when the 
morpheme "~" appears before the eojeol, it 
should be segmented as regular-verb ,;@o\] " 
(meaning shrink) plus conjunctive-ending ':2.". 
This kind of segmentation-error correction can 
greatly enhance the tagging performance in 
Korean. The rules are automatically learned by 
Comparing the correctly tagged corpus with the 
outputs of the statistical tagger. The training 
is leveraged (Brill, 1992) so the error-correcting 
rules are gradually learned as the statistical 
tagged texts are corrected by the rules learned 
so far. 
6 Experiment results 
For morphological analysis and POS tagging ex- 
periments, we used 130000 morpheme-balanced 
training corpus for statistical parameter estima- 
tion and 50000 morpheme corpus for learning 
the post error-correction rules. These training 
corpora were collected from various sources such 
as internet documents, encyclopedia, newspa- 
pers, and school textbooks. 
For the test set, we carefully selected three dif- 
ferent document sets aiming for a broad cover- 
age. The document set 1 (25299 morphemes; 
1338 sentences) is collected from :'Kemong en- 
cyclopedia 6,, hotel reservation dialog corpus 7 
and internet document, and contains 10% of 
unknown morphemes. The documents set 2 
(15250 morphemes; 5774 sentences) is solely col- 
lected from various internet documents from 
assorted domains such as broadcasting scripts 
and newspapers, and has about 8.5% of un- 
known morphemes. The document set 3 (20919 
morphemes; 555 sentence) is from Korean stan- 
dard document collection set called KTSET 2.0 s 
and contains academic articles and electronic 
newspapers. This document set contains about 
1470 unknown morphemes (mainly technical jar- 
gons). 
Table 4 showsour taggingperformance for these 
three document sets. This experiment shows 
efficiency of our unknown morpheme handling 
and guessing techniques since we can confirm 
the sharp performance drops between tagger-a 
and tagger-b. The post error correction rules 
are also proved to be effective by the perfor- 
mance drops between the full tagger and tagger- 
a, but the drop rates are mild due to the perfor- 
mance saturation at tagger-a, which means that 
our statistical tagging alone already achieves 
state-of-the-art performance for Korean mor- 
pheme tagging. 
6from ETRI 
:from Sogang University, Seoul. Korea 
Sfrom KT(Korea Telecom) 
91 
document set full tag~ier tagger-a tagger-b tagger-c 
set 1 97.2 96.4 89.5 87.1 
set 2 96.9 96.0 92.8 89.0 
set 3 97.4 96.7 88.7 84.8 
total 97.2 96.4 90.3 I 87.0 
Table 4: Tagging and unknown morpheme guessing performance (all in %). Experiments are 
performed on three different document sets as e~xplained in the text. The full tagger designates our 
POS tagger with all the morphological processing capabilities. The tagger-a is a version without 
employing post error-correction rules. The tagger-b is a more degraded version which does not 
utilize our unknown morpheme guessing capability but treats all unknown morphemes as nouns. 
The tagger-c is an even more deteriorated version which rejects all unknown morphemes as tagging 
failures. The performance drops as we degrade the version from the full tagger. 
7 Conclusion and future works 
This paper presents a pattern-dictionary based 
unknown-morpheme guessing method for a 
statistical/rule-based hybrid tagging system 
which itself exhibits many novel ideas of POS 
tagging such as experiment-based new statisti- 
cal model for Korean, rule based error correc- 
tion and hierarchically expandable tag sets. The 
system POSTAG was developed to test. these 
novel ideas especially for agglutinative lan- 
guages such as Korean. Japanese is also similar 
to Korean in linguistic characteristics and will 
be a good target of these ideas. POSTAG inte- 
grates morphological analysis with generalized 
unknown-morpheme handling so that unknowm- 
morpheme can be processed in the same manner 
as registered morphemes using morpheme pat- 
tern dictionary. POSTAG adopted a hybrid ap- 
proach by cascading statistical tagging to rule- 
based error-correction. Cascaded training was 
implemented to selectively learn statistical tag- 
ging error-correction rules by Brill style trans- 
formation approach. POSTAG also employs hi- 
erarchical tag sets that are flexible enough to 
expand/shrink according to the given applica- 
tions. The hierarchical tag sets can be mapped 
to any other existing tag set as long as they 
axe decently classified, and therefore can en- 
coverage a corpus sharing in Korean tagging 
community. POSTAG is constantly being im- 
proved by expanding the morpheme dictionary, 
pattern-dictionary, and tagged corpus for statis- 
tical training and rule learning. Since general- 
ized unknown-morpheme handing is integrated 
into the system, POSTAG is a good tagger for 
open domain applications such as internet in- 
dexing, filtering, and summarization., and we 
are now developing a web indexer using the 
POSTAG technology. 

References 
E. Brill. 1992. A simple rule-based part-of, 
speech tagger. In Proceedings of the confer- 
ence on applied natural language processing. 
E. Brill. 1994. Some advances in 
transformation-based part-of-speech tag- 
ging. In Proceedings of the AAAI-94. 
D. Cutting, J. Kupiec, J. Pedersem and P. Si- 
bun. 1992. A practical part-of-speech tagger. 
In Proceedings of the conference on applied 
natural language processing. 
G. Forney. 1973. The Viterbi algorithm. Proc. 
of the IEEE, 61:268-278. 
S. S. Kang. 1993. Korean morphological analy- 
sis using syllable information and multiple- 
word units. Ph.D. thesis, Department of 
Computer Engineering, Seoul National Uni- 
veristy. (written in Korean). 
J. Kupiec. 1992. Robust part-of-speech tag- 
ging using a hidden markov model. Computer 
speech and language, 6:225-242. 
B. Merialdo. 1994. Tagging English text with 
a probabilistic model. Computational linguis- 
tics, 20(2):155-171. 
R. Sproat. 1992. Morphology and computation. 
The MIT Press: Cambridge, MA. 
A. Voutilainen. 1995. A syntax-based part-of- 
speech analyzer. In Proceedings of the sev- 
enth conference of the European chapter of 
the association for computational linguistics 
(EACL-95), pages 157-164. 
lq.. Weischedel, M. Meteer, R. Schwartz, 
L. Ramshaw, and J. Palmucci. 1993. Coping 
with ambiguity and unknown words through 
probabilistic model. Computational linguis- 
tics, 19(2):359-382. 
