Part-of-Speech Tagging Considering Surface Form
for an Agglutinative Language
Do-Gil Lee and Hae-Chang Rim
Dept. of Computer Science & Engineering
Korea University
1, 5-ka, Anam-dong, Seongbuk-ku
Seoul 136-701, Korea
CUdglee, rimCV@nlp.korea.ac.kr
Abstract
The previous probabilistic part-of-speech tagging
models for agglutinative languages have consid-
ered only lexical forms of morphemes, not surface
forms of words. This causes an inaccurate cal-
culation of the probability. The proposed model
is based on the observation that when there exist
words (surface forms) that share the same lexical
forms, the probabilities to appear are different from
each other. Also, it is designed to consider lexi-
cal form of word. By experiments, we show that
the proposed model outperforms the bigram Hidden
Markov model (HMM)-based tagging model.
1 Introduction
Part-of-speech (POS) tagging is a job to assign a
proper POS tag to each linguistic unit such as word
for a given sentence. In English POS tagging, word
is used as a linguistic unit. However, the num-
ber of possible words in agglutinative languages
such as Korean is almost infinite because words can
be freely formed by gluing morphemes together.
Therefore, morpheme-unit tagging is preferred and
more suitable in such languages than word-unit tag-
ging. Figure 1 shows an example of morpheme
structure of a sentence, where the bold lines indi-
cate the most likely morpheme-POS sequence. A
solid line represents a transition between two mor-
phemes across a word boundary and a dotted line
represents a transition between two morphemes in a
word.
The previous probabilistic POS models for ag-
glutinative languages have considered only lexical
forms of morphemes, not surface forms of words.
This causes an inaccurate calculation of the proba-
bility. The proposed model is based on the obser-
vation that when there exist words (surface forms)
that share the same lexical forms, the probabilities
to appear are different from each other. Also, it is
designed to consider lexical form of word. By ex-
periments, we show that the proposed model outper-
forms the bigram Hidden Markov model (HMM)-
based tagging model.
2 Korean POS tagging model
In this section, we first describe the standard
morpheme-unit tagging model and point out a mis-
take of this model. Then, we describe the proposed
model.
2.1 Standard morpheme-unit model
This section describes the HMM-based morpheme-
unit model. The morpheme-unit POS tagging model
is to find the most likely sequence of morphemes C5
and corresponding POS tags CC for a given sentence
CF, as follows (Kim et al., 1998; Lee et al., 2000):
A0B4CFB5
CSCTCU
BP
CPD6CVD1CPDC
C5BNCC
C8B4C5BNCCCYCFB5
BP CPD6CVD1CPDC
D1
BDBND9
BND8
BDBND9
C8B4D1
BDBND9
BND8
BDBND9
CYDB
BDBND2
B5 (1)
AP CPD6CVD1CPDC
D1
BDBND9
BND8
BDBND9
C8B4D1
BDBND9
BND8
BDBND9
B5 (2)
In the equation, D9B4BQBP D2B5 denotes the number of
morphemes in the sentence. A sequence of CF BP
DB
BDBND2
BP DB
BD
DB
BE
A1A1A1DB
D2
is a sentence of D2 words, and a
sequence of C5 BP D1
BDBND9
BP D1
BD
D1
BE
A1A1A1D1
D9
and a se-
quence of CC BP D8
BDBND9
BP D8
BD
D8
BE
A1A1A1D8
D9
denote a sequence
of D9 lexical forms of morphemes and a sequence of
D9 morpheme categories (POS tags), respectively.
To simplify Equation 2, a Markov assumption is
usually used as follows:
A0B4CFB5 AP CPD6CVD1CPDC
D1
BDBND9
BND8
BDBND9
D9
CH
CXBPBD
C8B4D8
CX
CYD8
CXA0BD
BND4B5C8B4D8
CX
CYD1
CX
B5 (3)
where, D8
BC
is a pseudo tag which denotes the begin-
ning of word and is also written as BUC7CF. D4 de-
notes a type of transition from the previous tag to
the current tag. It has a binary value according to
the type of the transition (either intra-word or inter-
word transition).
As can be seen, the word
1
sequence DB
BDBND2
is dis-
carded in Equation 2. This leads to an inaccurate
1
A word is a surface form.
na/NNP
na/VV
na/VX
nal/VV
neun/PX
neun/EFD
hag-gyo/NNC e/PA
ga/VV
ga/VX
gal/VV
n-da/EFF
n-da/EFC
BOS
EOS
Figure 1: Morpheme structure of the sentence “na-neun hag-gyo-e gan-da” (I go to school)
calculation of the probability. A lexical form of a
word can be mapped to more than one surface word.
In this case, although the different surface forms are
given, if they have the same lexical form, then the
probabilities will be the same. For example, a lexi-
cal form mong-go/nc+leul/jc
2
, can be mapped from
two surface forms mong-gol and mong-go-leul.By
applying Equation 1 and Equation 2 to both words,
the following equations can be derived:
C8B4mong-goBND2CRBNleulBNCYCRCYmong-golB5
AP C8B4mong-goBND2CRBNleulBNCYCRB5 (4)
C8B4mong-goBND2CRBNleulBNCYCRCYmong-go-leulB5
AP C8B4mong-goBND2CRBNleulBNCYCRB5 (5)
As a result, we can acquire the following equation
from Equation 4 and Equation 5:
C8B4 mong-goBND2CRBNleulBNCYCRCYmong-golB5
BP C8B4mong-goBND2CRBNleulBNCYCRCYmong-go-leulB5 (6)
That is, they assume that probabilities of
the results that have the same lexical form
are the same. However, we can easily
show that Equation 6 is mistaken: Actually,
C8B4mong-goBND2CRBNleulBNCYCR CY mong-go-leulB5 BP BD
and C8B4mong-golBND2CR CY mong-golB5 BIBP BC.
Hence, C8B4mong-goBND2CRBNleulBNCYCR CY mong-golB5 BO
C8B4mong-goBND2CRBNleulBNCYCRCYmong-go-leulB5.
To overcome the disadvantage, we propose a new
tagging model that can consider the surface form.
2.2 The proposed model
This section describes the proposed model. To sim-
plify the notation, we introduce a variable R, which
means a tagging result of a given sentence and con-
sists of C5 and CC.
A0B4CFB5
CSCTCU
BP
CPD6CVD1CPDC
C5BNCC
C8B4C5BNCCCYCFB5 (7)
BP CPD6CVD1CPDC
CA
C8B4CACYCFB5 (8)
2
mong-go means Mongolia, nc is a common noun, and jc is
a objective case postposition.
The probability C8B4CACYCFB5 is given as follows:
C8B4CACYCFB5 BP C8B4D6
BDBND2
CYDB
BDBND2
B5 (9)
BP
D2
CH
CXBPBD
C8B4D6
CX
CYDB
BDBND2
BND6
BDBNCXA0BD
B5 (10)
AP
D2
CH
CXBPBD
C8B4D6
CX
CYDB
CX
BND6
CXA0BD
B5 (11)
where, D6
CX
denotes the tagging result of CXth word
(DB
CX
), and D6
BC
denotes a pseudo variable to indicate
the beginning of word. Equation 9 becomes Equa-
tion 10 by the chain rule. To be a more tractable
form, Equation 10 is simplified by a Markov as-
sumption as Equation 11.
The probability C8B4D6
CX
CYDB
CX
BND6
CXA0BD
B5 cannot be calcu-
lated directly, so it is derived as follows:
C8B4D6
CX
CYDB
CX
BND6
CXA0BD
B5 BP
C8B4DB
CX
BND6
CXA0BD
BND6
CX
B5
C8B4DB
CX
BND6
CXA0BD
B5
(12)
AP
C8B4DB
CX
B5C8B4D6
CX
CYDB
CX
B5C8B4D6
CXA0BD
CYD6
CX
B5
C8B4DB
CX
B5C8B4D6
CXA0BD
B5
(13)
BP
C8B4D6
CX
CYDB
CX
B5C8B4D6
CXA0BD
CYD6
CX
B5
C8B4D6
CXA0BD
B5
(14)
BP C8B4D6
CX
CYDB
CX
B5
C8B4D6
CXA0BD
BND6
CX
B5
C8B4D6
CXA0BD
B5C8B4D6
CX
B5
(15)
Equation 12 is derived by Bayes rule, Equation
13 by a chain rule and an independence assumption,
and Equation 15 by Bayes rule. In Equation 15, we
call the left term “morphological analysis model”
and right one “transition model”.
The morphological analysis model C8B4D6
CX
CYDB
CX
B5 can
be implemented in a morphological analyzer. If a
morphological analyzer can provide the probability,
then the tagger can use the values as they are. Ac-
tually, we use the probability that a morphological
analyzer, ProKOMA (Lee and Rim, 2004) produces.
Although it is not necessary to discuss the morpho-
logical analysis model in detail, we should note that
surface forms are considered here.
The transition model is a form of point-wise mu-
tual information.
C8B4D6
CXA0BD
BND6
CX
B5
C8B4D6
CXA0BD
B5C8B4D6
CX
B5
BP
C8B4C5
CXA0BD
BNCC
CXA0BD
BNC5
CX
BNCC
CX
B5
C8B4C5
CXA0BD
BNCC
CXA0BD
B5C8B4C5
CX
BNCC
CX
B5
(16)
BP
C8B4D1
CXA0BD
BDBNCY
BND8
CXA0BD
BDBNCY
BND1
CX
BDBNCZ
BND8
CX
BDBNCZ
B5
C8B4D1
CXA0BD
BDBNCY
BND8
CXA0BD
BDBNCY
B5C8B4D1
CX
BDBNCZ
BND8
CX
BDBNCZ
B5
(17)
where, a superscript CX in D1
CX
BDBNCZ
and D8
CX
BDBNCZ
denotes the
position of the word in a sentence.
The denominator means a joint probability that
the morphemes and the tags in a word appear to-
gether, and the numerator means a joint probability
that all the morphemes and the tags between two
words appear together. Due to the sparse data prob-
lem, they cannot also be calculated directly from the
test data. By a Markov assumption, the denominator
and the numerator can be broken down into Equa-
tion 18 and Equation 19, respectively.
C8B4D1
BDBNCZ
BND8
BDBNCZ
B5 BP
CZ
CH
D0BPBD
C8B4D8
D0
CYD8
D0A0BD
B5C8B4D1
D0
CYD8
D0
B5 (18)
C8B4D1
CXA0BD
BDBNCY
BND8
CXA0BD
BDBNCY
BND1
CX
BDBNCZ
BND8
CX
BDBNCZ
B5 BP
CY
CH
D0BPBD
AI
C8B4D8
CXA0BD
D0
CYD8
CXA0BD
D0A0BD
B5
C8B4D1
CXA0BD
D0
CYD8
CXA0BD
D0
B5
AJ
A2 C8
CXD2D8CTD6
B4D8
CX
BD
CYD8
CXA0BD
CY
B5 A2C8B4D1
CX
BD
CYD8
CX
BD
B5
A2
CZ
CH
D1BPBE
AI
C8B4D8
CX
D1
CYD8
CX
D1A0BD
B5
C8B4D1
CX
D1
CYD8
CX
D1
B5
AJ
(19)
where, C8
CXD2D8CTD6
B4D8
CX
BD
CYD8
CXA0BD
CY
B5 means a transition probabil-
ity between the last morpheme of the B4CXA0BDB5th word
and the first morpheme of the CXth word.
By applying Equation 18 and Equation 19 to
Equation 17, we obtain the following equation:
C8B4D6
CXA0BD
BND6
CX
B5
C8B4D6
CXA0BD
B5C8B4D6
CX
B5
BP
C8
CXD2D8CTD6
B4D8
CX
BD
CYD8
CXA0BD
CY
B5
C8B4D8
CX
BD
CYBUC7CFB5
(20)
For a given sentence, Figure 2 shows the bigram
HMM-based tagging model, and Figure 3 the pro-
posed model. The main difference between the
two models is the proposed model considers surface
forms but the HMM does not.
3 Experiments
For evaluation, two data sets are used: ETRI POS
tagged corpus and KAIST POS tagged corpus. We
divided the test data into ten parts. The perfor-
mances of the model are measured by averaging
over the ten test sets in the 10-fold cross-validation
experiment. Table 1 shows the summary of the cor-
pora.
Table 1: Summary of the data
Corpus ETRI KAIST
Total # of words 288,291 175,468
Total # of sentences 27,855 16,193
# of tags 27 54
Generally, POS tagging goes through the fol-
lowing steps: First, run a morphological analyzer,
where it generates all the possible interpretations
for a given input text. Then, a POS tagger takes
the results as input and chooses the most likely one
among them. Therefore, the performance of the tag-
ger depends on that of the preceding morphological
analyzer.
If the morphological analyzer does not generate
the exact result, the tagger has no chance to se-
lect the correct one, thus an answer inclusion rate
of the morphological analyzer becomes the upper
bound of the tagger. The previous works prepro-
cessed the dictionary to include all the exact an-
swers in the morphological analyzer’s results. How-
ever, this evaluation method is inappropriate to the
real application in the strict sense. In this experi-
ment, we present the accuracy of the morphologi-
cal analyzer instead of preprocessing the dictionary.
ProKOMA’s results with the test data are listed in
Table 2.
Table 2: Morphological analyzer’s results with the
test data
Corpus ETRI KAIST
Answer inclusion rate (%) 95.82 95.95
Average # of results per word 2.16 1.81
1-best accuracy (%) 88.31 90.12
In the table, 1-best accuracy is defined as the
number of words whose result with the highest
probability is matched to the gold standard over the
entire words in the test data. This can also be a tag-
ging model that does not consider any outer context.
To compare the proposed model with the standard
model, the results of the two models are given in
Table 3. As can be seen, our model outperforms the
HMM model. Moreover, the HMM model is even
worse than the ProKOMA’s 1-best accuracy. This
tells that the standard HMM by itself is not a good
model for agglutinative languages.
4 Conclusion
We have presented a new POS tagging model that
can consider the surface form for Korean, which
BOS EOSNNP
na
PX
neun
NNC
hag-gyo
PA
e
VV
ga
EFF
n-da
Figure 2: Lattice of the bigram HMM-based model
na/NNP+neun/PX hag-gyo/NNC+e/PA ga/VV+n-da/EFFBOS EOS
na-neun hag-gyo-e gan-da
Figure 3: Lattice of the proposed model
Table 3: Tagging accuracies (%) of the standard
HMM and the proposed model
Corpus ETRI KAIST
The standard HMM 87.47 89.83
The proposed model 90.66 92.01
is an agglutinative language. Although the model
leaves much room for improvement, it outperforms
the HMM based model according to the experimen-
tal results.
Acknowledgement
This work was supported by Korea Research Foun-
dation Grant (KRF-2003-041-D20485)
References
J.-D. Kim, S.-Z. Lee, and H.-C. Rim. 1998. A
morpheme-unit POS tagging model considering
word-spacing. In Proceedings of the 1998 Con-
ference on Hangul and Korean Information Pro-
cessing, pages 3–8.
D.-G. Lee and H.-C. Rim. 2004. ProKOMA:
A probabilistic Korean morphological analyzer.
Technical Report KU-NLP-04-01, Department of
Computer Science and Engineering, Korea Uni-
versity.
S.-Z. Lee, Jun’ichi Tsujii, and H.-C. Rim. 2000.
Hidden markov model-based Korean part-of-
speech tagging considering high agglutinativity,
word-spacing, and lexical correlativity. In Pro-
ceedings of the 38th Annual Meeting of the Asso-
ciation for Computational Linguistics.
