KCAT : A Korean Corpus Annotating Tool Minimizing Human 
Intervention 
Won-He Ryu, Jin-Dong Kim, line-Chang Rim 
Dept. of Computer Science & Engineering, 
Natural Language Processing Lab, 
Korea University 
Anam-dong 5-ga, Seongbuk-gu, Seoul, Korea 
whryu, jin, rim @nlp.korea.ac.kr 
Abstract 
While large POS(part-of-speech) annotated 
corpora play an important role in natural 
 processing, the annotated corpus 
requires very high accuracy and consistency. 
To build such an accurate and consistent 
corpus, we often use a manual tagging 
method. But the manual tagging is very 
labor intensive and expensive. Furthernaore, 
it is not easy to get consistent results from 
the humari experts. In this paper, we present 
an efficient tool lbr building large accurate 
and consistent corpora with minimal human 
labor. The proposed tool supports semi- 
automatic tagging. Using disambiguation 
rules acquired from human experts, it 
minimizes the human intervention in both 
the manual tagging and post-editing steps. 
1. Introduction 
The POS annotated corpora are very 
important as a resource of usefiil information tbr 
natural  processing. A problem for 
corpus annotation is tile trade-off between 
efficiency and accuracy. 
Although manual POS ta,,<,in,,== = is very 
reliable, it is labor intcnsive and hard to make a 
consistent POS tagged corpus. On the other hand, 
automatic ta,-,in,,>~ > is prone to erroi-s Ibr 
infrequently occurring words duo to tile lack el" 
overall linguistic information. At present, it is 
ahnost impossible to construct a highly accurate 
corptls by usin<,~ an automatic taggcr~ alone. 
/ks a consequence, a semi-autonmtic ta,,,,in,~== 
method is proposed IBi corpus annotation. In 
Heui-.Seok Lira 
Information Communications Department, 
Natural Language Processing Lab, 
Chonan University 
85-1, Anseo-Dong, Chonan City, 
ChungChong-NamDo Province, Korea 
timhs@inli~com.chonan.ac.kr 
ordiriary semi-automatic tagging, an automatic 
tagger tags each word and human experts correct 
the rots-tagged words in the post-editing step. 
But, in the post-editing step, as the human expert 
cannot know which word has been annotated 
incorrectly, he must check every word in the 
whole corpus. And he lnust do the same work 
again and again for the same words in the same 
context. This situation causes as Inuch 
labor-intensive work as in manual ta<+<qlw 
In this paper, we propose a semi-automatic 
tagging method that can reduce the human labor 
and guarantee the consistent tagging. 
2o System Requivemer~ts 
To develop ari efficient tool that attempts to 
build a large accurately armotated corpus with 
minimal human labor~ we must consider the 
following requirements: 
® In order to minimize human labor, the same 
human intervention to tag and to correct the 
same word in tile same context should not be 
repeated. 
* There may be a word which was tagged 
inconsistently in the same context becatlse it 
was tagged by different human experts or at a 
different task time. As an elticient tool, it can 
prevent tile inconsistency of tile annotated 
(I results and ~uarantec the consistency of the 
annotated results. 
* It must provide an effective annotating 
capability lbr many unknown words in the 
whole corpus. 
1096 
3. Proposed POS Tagging ToohKCAT 
The proposed POG tagging tool is used to 
combine the manual tagging method and the 
automatic tagging method. They are integrated 
to increase the accuracy o\[" the automatic tagging 
method and to minimize the amount of tile 
human labor of thc manual tagging method. 
Figure 1 shows the overall architecture of the 
proposed tagging tool :KCAT. 
I ......... I I I ~ I P I Raw (..rpus ILI 
Post-FnJcess ~t Ire-lrocess 
('cJrrect an ~ ........... :~ ; .... ,R s " " .......... 
--7--~ i --g 
~____~ " : . 
i ~'f::: 2 aa' :ii,:n~ ...... 
Figure 1. System Architecture of KCAT 
As shown in figm'e 1, KCAT consists of 
three modules: the pre-processing module, the 
automatic tagging module, and the 
post-processing module. In the prcoprocessing 
module, the disambiguation rules are acquired 
I%m human experts. The candidate words are 
Ihe target words whose disambiguation rules are 
acquired. The candidate words can be unknown 
words and also very frequent words. In addition, 
the words with problematic ambiguity for tlle 
automatic tagger can become candidates. 
l)lsamblguation rules are acquired with minimal 
human labor using tile tool t:n'oposed in 
(Lee, 1996). In the automatic tagging naodule, the 
disambiguation rules resolve the ambiguity of 
{,'very word to which they can be applied. 
I lowever, tile rules are certainly not sufficient to 
resolve all the ambiguity of the whole words in 
file corpus. The proper tags are assigned to the 
remaining ambiguous words by a stochastic 
< t~" c, hLllllan lagger. After the automatic t,t~m~, a 
expert corrects tile onors o\[ the stochastic ta,me, 
The system presents the expert with the results 
of the stochastic tagger. If the result is incorrect, 
tile hulllan expel1 corrects the error and 
generates a disambiguation rule ~br the word. 
The rule is also saved in the role base in order to 
bc used later. 
3. I. l.exical Rules for Disambiguation 
There are many ambiguous words that are 
extremely difficult to resolve alnbiguities by 
using a stochastic tagger. Due to the problematic 
words, manual tagging and manual correction 
must be done to build a correct coqms. Such 
human intervention may be repeated again and 
again to tag or to correct tile same word in the 
same context. 
For example, a human expert should assign 
'Nal(flying)/Verb+Neun/Ending' to every 
'NaNemf repeatedly in the following sentences: 
" Keu-Nyeo-Neun Ha-Neul-Eul Na-Neun 
Pi-Haeng-Ki-Reul Port Ceok-i Iss-Ta." (she has 
seen a flying plane) 
"Keu-Netm lht-Nc'ul-Eul NaoNeun 
t'i-Itaeng--Ki-Reul Port Ceok-i Eops-Ta." (he has 
never seen a flying phme) 
"Keu-Netm tta-Ne,tl-Eul Na-Neun 
Pi--ttaeng--Ki-Reul Pal-Myeong-tlaess- Ta." (he 
invented a flying plane) 
In the above sentences, human experts can 
resolve the word, 'Na-Nemf with only the 
previous and ttle next lexical information: 
'fla-Neul-Eul' and 'Pi-tlaeng- Ki-Reul'. In other 
words, tile human expert has to waste time on 
tagging the same word in tile same context 
repeatedly. This inefficiency can also be 
happened in the manual correction of the 
ntis-tagged words. So, if the human expert can 
make a rule with his disambiguation knowledge 
and use it for tile same words in tile same 
context, such inefficiency can be minimized. We 
define the disambiguation rule as a lexical rule. 
Its template is as follows. 
\[P:N\] \[Current Word\] \[Context\] = \[Tagging 
P, esuh\] 
Context • Previous words°p * Next Words°,, 
Ill tile above template, p and n mean tile 
previous and the next context size respectively. 
For the present, p and n are limited to 3. '*' 
1097 
represents the separating mark between the 
previous and next context. For example, tile rule 
\[1:1\] \[Na-,'\:lten\] \[Ha-Neul-Eld * Pi-Haeng-Ki- 
Reul\] = \[Na/(flying)/Verb i- Neun/Ending \] says 
the tag 'Nal(flying)/Verb + Neun/Ending' should 
be assigned to the word 'Na-Neun' when the 
previous word and the next word is 
'Ha-Neul-Eul' and 'Pi-Haeng-Ki-Reul'. 
Although these lexical rules cannot always 
correctly disambiguate all Korean words, they 
are enough to cover many problematic 
ambignous words. We can gain some advantages 
of using the lexical rule. First, it is very accurate 
because it refers to the very specific lexical 
information. Second, the possibility of rule 
conflict is very little even though the number of 
the rules is increased. Third, it can resolve 
problematic ambiguity that cannot be resolved 
without semantic inf'onnation(Lim, 1996). 
3.2. Lexicai Rule Acquisition 
Lexical rules are acquired for the unknown 
words and the problematic words that are likely 
to be tagged erroneously by an automatic tagger. 
Lexical rule acquisition is perlbrmed by 
following steps: 
1. The system builds a candidate list of 
words li)r which the lexical rules would be 
acquired. The candidate list is the collection 
of all examples of unknown words and 
problematic words for an automatic tagger. 
2. A human expert selects a word from the 
list and makes a lexical rule for the word. 
3. The system applies tile lexical rule to all 
examples of the selected word with same 
context and also saves the lexical rule in the 
rule base. 
4. P, epeat tile steps 2 and 3 until all 
examples of the candidate words can be 
tagged by the acquired lexical rules. 
3.3. Automatic Ta,,,in,, 
In the automatic ta,,~dn-oo ~ phase, words are 
disambiguated by using the lexical rules and a 
stochastic tagger. To armotate a word in a raw 
corpus, the rule-based tagger first searches the 
lexical rule base to find a lexical rule that can be 
nlatched with tile given context. If a matching 
rnle is found, the system assigns the result of the 
rule to the word. According to the corresponding 
rule, a proper tag is assigned to a word. With tile 
lexical rules~ a very precise tag can be assigned 
to a word. However, because the lexical rules do 
not resolve all the ambiguity of the whole corpus, 
we must make use of a stochastic tagger. We 
employ an HMM--based POS tagger for this 
purpose(Kim,1998). The stochastic tagger 
assigns the proper tags to the ambiguous words 
afier the rule application. 
Alter disambiguating the raw corpus using 
the lexical rules and the atttomatic tagger, we 
arrive at the frilly disambiguated result. But the 
word tagged by the stochastic tagger may have a 
chance to be mis-tagged. Therefore, the 
post-processing for error correction is required 
for the words tagged by the stochastic tagger. 
3.4. Error Correction 
The human expert carries out the error 
correction task for the words tagged by a 
stochastic tagger. This error correction also 
requires tile repeatecl human labor as in the 
manual tagging. We employ the similar way of 
the rule acquisition to reduce the human labor 
needed for manual error cmTection. The results 
of the automatic tagger are marked to be 
distinguished from tile results of the rule-based 
tagger. The human expert checks the marked 
words only. If an error is found, the ht/man 
expert assigns a correct tag to the word. When 
tile expert corrects the erroneous word, tile 
system automatically generates a lexicat rule and 
stores it in tile rnle base. File newly acquired 
rule is autoinatically applied to the rest of tile 
corpus. Thus, the expert does not need to correct 
the repeated errors. 
1098 
B ........ 
A ...J 
:~,£ "~; ~ ~'J ~'Y,I .:'l~ll,k! G'~(~ ~:)':'fl,q !!!~l))L",l')l ,q';'.%ll.q !ll~ ~. "?1 )~:d~ 
:':} 'k L' ~i~ tl ? r31 ,31 ?2 :~ '2'.' :~ ,:,i\[ .~ YZ. "±! :'J q l :'112j X,"~ ?t ) I -@ ~! ? I 
".'.20 t ~'~tJ 2: .c Ul I '3t3b!:! I ~ :~ '~ (IM{3tl *,1 N ~ :31 ,'q ~£i ::i ;' £,,3 ~ :~ ~J g~ "JH G 
r.NwO}.lx* I '¢v¢et~a5 :!~31 W~'gffll Y'~a! tlt~0l adlTll',lLr'9~. 
r ,~ wU,t.l:<t E : '¢t:S eft ~1:" 
E(';'.hTF ~ I '~ t 
,iH,'.t CH.q',~'~ NcrsicTc~ 
,:'I~L.IOF',I~qEA ~x~.'qlOL }~M~?-"--'?II~ "~. 
=i~I 'q 
:" Gt~i!~} 5"3~d~/t,\]HP*~:}IJ'2 
J, kt21 ~t X}el/tlt'l!3 • N,,'JF O :" h!T;UIHOII !,~T¢~I'I/f'g'dG.OII/JC 
:. *.;9 E ,~;.,V~.,*°.L.'EP* /EF- GF 
~.~n ~ 7;/I,iN p. ~,,d X 
:, ~la~ At ~I"~INhII\]-.Lt/JFB 
> @~(~,~ .. @~iNNP-(/SS,~'I/SH. 
g~ ~ 11 t ,'(I ~J L~ .~,!/N N P • Ilt Xl ~I/JK G 
> ZI~01 M XI~/NN,-~*OtlM/Ji:B 
£H ;~°I 01 ~/NN,5. ~I,/.JK6 
G' Xtl ~ 7? .~tl / r,~ N 6 • ~ ,/o K o "-._" 
> ~'\[£8101 gd,l> ~t' *~alDN G-0~/J > ~.~. $tlVV*92/EP-~niEF - zSP 
i'~ 92XI g, ol S'!gMtaNG.XI%VNN,3.°I/. 
> ~Xl2J~ ~XI?J/NN?'~IJF.O 
~ )I a~ IJ}:?_k ~ )I/NNG -8 }/XSV-OHjLnt ; ~.E}. 8t/VV*?)\[EP.E~/EF*/SF 
\] ~,J '_'6&}; .3} .~4j~q <?Jr 't ~r~?j ~.II?41~.L ~> k12I > 3~t?ItG'°3 )ldlt ~"/q'41'lG.?l ', KG 
......... 21 x oH,~t'~011:,11 ~,,ki~_~ ~HYj9 :;'.3o x ~,-a4 ~,i1,,'i~bJ 3*/SP • 
.... 2\[_ '. f~l_ ___-~-~A Xll, )lge. 131Ct1~21 2N.~gJ ~,{~0"IIM)~- ............. I .......... ;111~ J 
::,~2!;,7~,-~ -~.,~,,~.~. ........................................ ~,~_ ...... 
Figure 2. Building Annotated Corpus Usiug KCAT 
4. Application to Build Large Corpora 
Based on the proposed method~ we have 
imrdemented, a corpus--annotating tool for 
Koreart which is named as KCAT(Korean 
Corpus Annotating 'Fool). The process of 
building large corpora with KCAT is as lbllows: 
1. The lexical roles in the rule base are 
applied to a raw corpu::,. If the rule base i!; 
empty, nothing will be done. 
2. The sy,~;tem makes a candidate li';t. 
3. Ilunmn expert produces the lexical 1.ules 
for the words in the candidate list. 
4. The .~;ystem tags the corpus by using the 
lexical rHles and a stochastic t,l~,~.c~. 
5. Hunmn manually con°cots errors caused by 
the stochastic tagger, and lexical rules for 
those errors are also stored in the 
role--base. 
6. For other corpus, repeat the steps 1 
through 5. 
Figure 2 shows a screenshot of KCAT. In this 
figure, "A' window represents the list of raw 
corpus arm a "B' window contains the contcnt of 
the selected raw corpus in the window A. The 
tagging result is displayed in the window 'C'. 
Words beginning with ">' are tagged by a 
stocha,,;tic la-<,e, and the other words are ta~Eed 
by lexical rules. 
We can -et the more lexical rules as the 
ta,,,,itw process is prom-esscd. Therefore, we can 
expect that the aecunu-y and the reduction rate 
C 
of human htbor are increased a~ long as the 
tagging process is corltilmed. 
5. Experimental Results 
In order to estimate tim experimental results 
of our system, we collected the highly 
ambiguous words and frequently occurring 
words in our test corpus with 50,004 words. 
\]able I shows reductions in human intervention 
required to armotate the raw coums when we use 
lexical rules lbr the highly ambiguous words and 
the frequently occurring words respectively. The 
second colurnn shows that we examined the 
4,081 OCCLirrences of 2,088 words with tag 
choices above 7 and produced 4,081 lexical 
rules covering 4,832 occurrences of the corpl_lS. 
In this case, the reduction rate of human 
intervention is 1.5%. ~ The third column shows 
that we exalnined thc 6,845 occurrences of 511 
words with ficqucncy above 10 and produced 
6,845 lexical rules covering 15,4 l 8 occurrences 
of the corpus. In tiffs case, the reduction rate of 
human intervention is 17%. 2 
The last row in the table shows how 
intbrnmtive the rules are. We measured it by the 
inq-~iovement rate of stochastic tagging ;_!.l'l.el- the 
rules arc applied. From these experimental 
result.~;, wc can judge that rule-acquisition from 
flcquelatly occurring words is preferable. 
i (4,~., _4,(),v; l ) / 50,004 
~. ( 15,41 x-6,g~b ) / 50,004 
1099 
Table 1. Reduction in human Intervention 
I Type of word 
lbr rule 
acquisition 
Number of 
words 
Ambiguous 
words (_>7) 
Frequently 
occurring 
words (_>10) 
4832(9.6°/,,) 15418(30%) 
Number of 408 l 6845 
lexical rules 
Decrement of 1.5% 17% 
h u lll a 11 
intervention 
hnprovement 1.6% 3.7% 
of tagging 
accttracy (94.1-92.5%) (95.2-92.5%) 
Table 2 shows the results of our experiments on 
tile applicability of lexical rules. We measure it 
by the improyement rate of stochastic tagging 
alter the rules acquired from other corpus are 
applied. 
The third row shows that we annotate a training 
corpus with 10,032 words and produce 631 
lexieal rules, which can be applied to another 
test corpus to reduce tile number of the 
stochastic ta-,,in,, errors frorn 697 to 623. 3 
The ~brth and fifth row show that as the number 
of lexical rules is increased, the number of the 
errors of the tagger is decreased on the test 
corpus. 
These experilnental results demonstrate tile 
promise of gradual decrement of human 
intervention and improvement of tagging 
accuracy in annotating corpora. 
Table 2. Applicability of Lexical Rules 
Size of tile The nunaber The number of 
corpus of lexical stochastic 
roles errors 
0 0 697 
10,032 631 62.3 
20,047 136l 565 
_~( ,049 2091 538 
6. Conclusion 
The main goal of our work is to dcvelop an 
efficiclat tool which supports to build a very 
3 Our test corpus includes 10,015 words 
accurately and consistently POS annotated 
corpus with nlinilnal hunmn labor. To achieve 
the goal, we have proposed a POS ta,,-in- tool 
named KCAT which can use human linguistic 
knowledge as a lexical rule form. Once a lexical 
role is acquired, the hutnan expert doesn't need 
to spend titne in tagging the same word in the 
same context. By using the lexical roles, we 
could have very accurate and consistent results 
as well its reducing the amount of the hurnan 
labor. 
It is obvious that the more lexical roles the 
tool acquires the higher accuracy and 
consistency it achieves. But it still requires a lot 
of human labor and cost to acquire many lexical 
rules. And, as the number of the lexical rules is 
increased, the speed of rule application is 
decreased. To overcome the barriers, we try to 
find a way of rule generalization and a more 
efficient way of rule encoding scheme like the 
finite-state atttomata(Roche, 1995). 
Furthermore, we will use the distance of the 
best and second tag's probabilities to classify 
reliable automatic tagging result and unreliable 
ta,,,,in,, result(Brants, 1999). 

References 

Brants~ T. Skut, W. and Uszkoreit, H. (1999) 
A),ntac'tic /hmotatio/1 of a German N~.-*lri'Spal)e\]" 
Coums. In "Jourrlees ATALA", pp.69°76. 

Kim, J. D. Lira, H. S. and Rim, H. C. (1998) 
Morl)henle-Unit POS Tagging Mode/ 
Considering Eojeol-Spacing. In "Proc. of the 
10th ttangul and Korean Information 
Processing Conference", pp.3-8. 

Lee, J. K. (1996) Eojeol-tmit rule Based POS 
tag~in£~ with minimal human intervention. M. 
S dissertation, Dept. of Computer Science and 
Engineering, Korea Univ. 

Lira, H. S. Kim, J. D. and Rim, H. C. (1996) .4 
Korean 1)'an.@)rmation-I~axed POS Tagger 
with Lexical h!fi)rmation ojmi.vtag,4ed Eojeo\[. 
In "Proc. of the 2nd Korea-China Joint 
Symposium on Oriental Language 
Computing", pp. 119-124. 

Roche, E. and Schabes, Y. (1995)Determini.s'tic 
Part-o.f-St)eect~ Taggi/Ig with Fi//te-State 
7?aHsduc'er. Computational Linguistics, 21/2, 
pp. 227-253.
