Korean Language Engineering: Current Status of the 
Information Platform * 
Kim, Seongyong and Choi, Key-Sun 
l)epartment of Computer Science 
Korea Advanced histitute of Science and ~\[7"~chnology 
Taejon, Korea 
sykim@csking.kaist.ac.kr, kschoi@worhl.kaist.ac.kr 
Abstract 
Language engineering implenicnts func- 
tions of a language and inforillation via 
computers. '\['he need for language en- 
gineering plattbrms has been generally 
recognized and several researches are be- 
ing undertaken around the worhl. Our 
goal is to establish Korean inforn-iation 
platform of linguistic resources and tools 
for Korean language and information 
colnumnities. The platform will sup- 
port researchers and engineers with well- 
developed and standardized resources 
and al)plication tools thereby avoiding 
duplicate activities fi'om scratch a.nd ani- 
plifyiilg overall effort on the domain. 
This paper reports tile components and 
the current status of the project, and the 
importance of the effort. 
1 Korean Language Engineering 
1.1 Language Engineering 
Language engineering is slich an activity that im- 
plements various fnnctions related to a language 
and builds lip an information base. It realizes 
linguistic activities of everyday life and linguis- 
tic competence of human beings with the aids of 
computer science, thereby supporting people's in- 
tellectual linguistic productions. The language en- 
gineering not only collects and disseminates tile 
informat, ion and knowledge of ~t language, among 
the linguistic society but also serves as a Younda- 
tion on which linguistic culture and ~echnologies 
can be based (Oh et al., 1994). 
1.2 Korean Language Engineering 
Korean language engineering is one for Korean 
language. It came into birth in early 1980's with 
the emergence of personal comt)uters (PCs). hi 
*'Fhis work is fimded by Ministry of Science and 
7~,clmology and Ministry of Education and Athletics, 
as a part of a contract by Center for Korean Language 
Engineering. 
the beginning, they focused on Korean alphabets 
and sonw scrappy parts of character processing, 
lacking the global view of the engineering ap- 
proaches. Technical approaches to Korean began 
with the formation of the special interest group 
on Korean information processing under tile Ko- 
rea Information Science Society. And in 1994 
(;enter for Korean Language Engineering (KLE) 
was founded to serve as a centrM organization for 
Korean language engineering, which aims to plan 
and progranl related projects an(i works in a con- 
sistent, systeinlttic way with long-teiun gems. It 
also incorporates academic and research institutes 
and hidustries into comnion goals: the etticient 
and imrmonious (lriw~ toward research and devel- 
opment, and establishment of long-range policies 
and strategies for Korean la.ngu~tge engineering. 
2 Areas of Korean Language 
Engineering Researches 
According to the level of technologies, KLE par- 
titioned its projects into ttiree classes. 
Fundamental technology deals with radical and 
theoretical researches, collection and nlanipula- 
tion of data, and standardization. In linguis- 
tic viewpohlt, these include language \[ornialisms, 
text corpora, and statistical int'ormation of a lan- 
guage. On infornlation enginee.ring side, the tech- 
nology covers information interchange and com- 
pression techniques, basic techniques of artifi- 
cial intelligence such as knowledge representation, 
searching, and tools for manipulating Korean al- 
phabets. From the cognitive engineering point of 
view, the research focuses on the structure of Ko- 
rean alphabets, fonts, command structures, and 
interdisciplinary works of cognitive science. A/> 
o~her division handles standardization issues for 
code schemes and w)cabnlaries, keyboard layout, 
standard text formats, and internationalization. 
"Pile second class is called basic technology, 
which is related to the basic software libraries for 
Korean language processing. Included in this class 
are natural language analysis, pattern recognition, 
multimedia data base, and data conversion tools. 
The third class is applications technology. It 
1049 
network 
.~.o~r~sp, etto,'m =1/ ac:c~ss ~ ,~ 
h~p s~rvor ~v ''v ~ ~ ftp server 
l~~~ C(31 Olcb'Tenn~lin c~ o 
l. U: Unix 
"TexI/Di:MS 1 S: Solelris W: V~4 nd a,w ~, 
Mgt b~y~,t~rn U/..~.t,~ Developing for Unix 11rs't, 
V~n4o.;~s Platform then for S end W 
Figure 1: The Conceptual Diagram of the Infor- 
mation Platform 
consists of systems for text interchange and com- 
pression, hypertext, multimedia, word processing 
and others. For knowledge processing, it will cover 
document paraphrasing, indexing and retrieval, 
computer-based instruction/education, etc. 
3 Information Platform 
For Korean language engineering, it is necessary 
to develop systematically all the projects of each 
area and integrate them into a uniform frame, 
called an information platform (IP). 1 KLE pro- 
grams each project according to its priority and 
state-of-the-art technology. Consequently, \]P re- 
flects the status of ongoing projects and is an as-is 
framework on which further researches and devel- 
opment works can be performed. 
Figure 1 shows the conceptual diagram of IP. 
This platform doesn't integrate all the project 
outcomes but some of the 5mdamental resources 
and basic tools, since it reflects the current config- 
uration that is not concrete but open to changes. 
The whole integration of the project outcomes will 
be available at the end of the first phase in 1997. 
This platform is different from ALEP (Ad- 
vanced Language Engineering Platform) (Simp- 
kins, 1994) in that ALEP is an environment that 
can be provided to users as a form of a (customiz- 
able) package whereas our platform is a server- 
client model in pursuit of a web-based service for 
resources and tools. 
Worldwide web is composed of hyperdocuments 
and hyperlinks to handle multimedia data as well 
as to provide easy and timely access to elec- 
tronic information. It uses hypertext markup lan- 
guage (HTML) based on standardized generalized 
markup language (SGML). Therefore, it guaran- 
tees the standardization and straightforward de- 
1 "ltttp://world.kaist.ac.kr/KLE/KIBS/" is 
SunOS, version 1 platform and web pages are only 
in Korean. The 2nd version will be released on Solaris 
at the address "http://kibs.kaist.ac.kr/KLE/KIBS/." 
sign characteristics, which lead to the ease of sys- 
tem design and tlexibility of the system config~ 
rations (Berners-Lee ~5 Connolly, 1993). Its other 
characteristic lies in the common gateway inter- 
face (CGI) which makes it possible to interface 
with various shell scripts and program codes with- 
out difficulties. Yet another point is that the 
server-client model makes the platform transpar- 
ent to the users. 
IP consists of three parts. First, text corpora, 
voice and handwritten scripts DBs, dictionaries 
and a set of terminological DBs constitute the in- 
formation base. The information base may di- 
rectly be distributed through ftp server or indi- 
rectly accessed by the language tools on the higher 
layer of the http server configuration. 
Secondly, language tools are running on the 
http server with the aids of CGI as well as be- 
ing ftp-ed to users as executable codes. Since we 
aim to provide software versions on Unix, Solaris, 
and PC Windows altogether, initial hardware re- 
quirements for each tool may be different. ~ 
Finally, documentation preparation will also be 
accompanied with the project's progress. 
4 Information Base 
4.1 Text Corpus 
Text corpora are essential to statistical modeling, 
in developing formal theories of the grammars, 
investigating prosodic phenomena in speech, and 
evaluating or comparing the adequacy of parsing 
models (Marcus et al., 1993). There are four sorts 
of corpora from contemporary Korean texts. 
• Raw corpus 
Two factors are the genre of each source text 
that is related to the objective(s) in using 
the corpus, and the category of the text that 
represents the internal structure of the text. 
Major sources of the corpus inchlde books, 
magazines, and newspapers; up to date three 
million word phrases are gathered. 
• Part-of-speech (POS) tagged corpus 
POS tagset for Korean originated from (Kiln 
L~ Seo, 1994). In version 1 platform we 
yielded 2.5 million automatically tagged word 
phrases and 1.5 million post-edited word 
phrases. 
• Tree-tagged corpus 
This can be produced by applying syntactic 
tagset to the POS tagged corpus. The syn- 
tactic tagset is being studied using 100,000 
sentences out of POS tagged corpus, and the 
resultant tree-tagged corpus using a tree tag- 
ger will appear at the end of this year. 
2For example, the text and dictionary manage- 
ment system is currently being built upon PC Win- 
dows so that Unix and Solaris executables are not yet 
available. 
1050 
• Categorized corpus 
Korean verbs and adjectives are classified into 
over seventy categories, and a set of sentence 
styles are investigated for 940 basic verbs of 
those categories. About thirty five thousand 
sentences are tangible in version 1 platform. 
4.2 Voice Data Base 
This resource can be used \['or speech recognition 
and synthesis applications. We initially focused 
on word-level voice data. It includes phoneti- 
cally balanced words, phonemic sequences pro- 
nounced by four different speakers, and narration 
of sample stories. It also stores the sounds of sin- 
gle syllables, diphones, numerics, high-frequency 
words, gazetteers, flmctional words, and consecu- 
tive word sequences. The data are stored in server 
disks and CD-ROMs as a wave form. This ef- 
fort will be extended to sentence-level collections 
such as phonetically balanced sentences, speech 
dialogues, and scenarios. 
4.3 Handwritten Scripts Data Base 
Since character recognition systems are under the 
control of applications engineers, the objective of 
this work is to provide well-tbrmed data and eval- 
uation criteria for those recognition systems. We 
stepwise our data collection into three phases: to 
scan, with 300 dpi resolution, one thousand sets of 
590 high-frequency syllables in the first year, then 
of 990 syllables and 2,350 syllables in the follow- 
ing years, a At each phase, we develop both the 
square-hand (:haracters and free-style characters. 
4.4 Dictionaries and Terminological Data 
Base 
• Multilingual technical dictionary 
The objective is to set up mappings between 
technical terms of Korean and other lan- 
gnage(s) in both directions. '\['he first work is 
done for computer science domain, and it has 
35,000 entries each for Korean and English. It 
will be extended to cover Chinese, Japanese, 
and German as well as more domains includ- 
ing electrical/electronic engineering, medical 
science, law, etc. 
® Monolingual terminology data bank 
Users need definitions and explanations of 
technical terms during their work on specific 
domains. This work provides users such ter- 
minological details. We assorted 15,000 en- 
tries each for culture/art and Korean classical 
literature. 
- Ontology-based lexicon 
Currently awnlable dictionaries are seman- 
tically oriented. They don't provide pools 
3It is possible to compose up to 11,172 syllables out 
of ea<:h Korean alphabet, but Korea/, Standard Code 
KSC-:5601 prescribes 2,350 complete codes for Korean 
syllables. 
of target language expressions but offer 
basic meanings for entries together with 
some syntactic and morphological informa- 
tion. Ontology-based lexicon is lexically ori- 
ented in that it guides the user to find a prag- 
matically or contextually equivalent expres- 
sion corresponding to the source language ex- 
pression. The work is on the phase of feasi- 
bility study with intensive locus on collecting 
Korean-English bilingual information sources 
and developing tools for lexicon construction. 
Lexicon for morphological analysis 
The lexicon for Korean morphological anal- 
ysis is currently being built to have 30,000 
entries with oil'-line management tools, and 
will grow to 100,000 entries with on-line tools 
after two more years. 4 
5 Language Engineering Tools 
Basically, the tools that we present here are for 
text corpus and dictionaries, except for voice and 
character recognizers. The latter two programs 
are currently under the develol)ment and will be 
integrated later. 
5.1 Morphological Analyzer 
MorI>hological analysis is an important but dilfi- 
c,lt t)art of the analysis since Korean is an aggluti- 
native language with sophisticated morpheme seg- 
mentation rules and morphotactic rules. The n\]or.- 
phological analyzer is based on the Korean chart 
parsing (Lee, 1993). Its' current precision is over 
92 percent for the grammatical inl)ut sentences. It 
aims to achieve 98 percent accuracy in two nrore 
years. It will be extended to cover special sym- 
bols, alien strings, elliptical or abbreviated words, 
and spell errors to earn higher accuracy. 
5.2 Tagger 
Because the output of morphological analysis is 
rather complex due to the characteristics of Ko- 
rean, the use of a tagger to reduce ambiguities 
seems important for further processing. (Shin 
et al., 1995) adopts the hidden Markov model 
and takes into account the characteristics of Ko- 
rean word phrase structures for more accurate tag- 
ging: a word phrase contains one or more roof 
phemes, syntactic information (grammatical rela- 
tions by bound morphemes), and semantic infof 
mation (case roles by postpositions). The exper- 
iments revealed 98 % accuracy for the test set of 
5,500 word phrases out of 55,000 training data, 
and 94.7 % tbr 5,500 untrained test data. 
~We can conceive much nlore types of dictionar- 
ies: for example, lexicons for syntactic attd semantic 
analyses, and dictionaries tha.t are to be created or ex- 
tracted from existing ones upon users' or developers' 
nee(Is. These will be i,clhded after the tirst phase of 
the project, following future direction of the project. 
1051 
Another approach is based on the Markov ran- 
dom field (MRF) theory (Jung, 1996), whose Ko- 
rean version will be added to IP this year. 
5.3 Tree Tagger 
(Kim, 1995) is a prototype using dependency 
grammar and adopting statistical methods for 
ranking the parse trees to get k-best parsing re- 
sults. Its current accuracy is about 80 % for the 
trained data. While this is a working prototype, 
we need a tree tagger with better performance 
so that another tree tagger using partial parsing 
method (Abney, 1991) is on breadboard. 
5.4 Korean/English Alignment System 
An alignment system gathers correspondences 
between surface representations of both lan- 
guages. (Shin, 1996) experimented expectation- 
maximization algorithm with 68.7 % accuracy at 
phrase level, and this will be incorporated into 
version 2 platform. 
5.5 KWIC Manager 
Keyword-in-context (KWIC) manager deals with 
word usage of text corpus. Its functions include in- 
dexing and searching word phrases, morphemes or 
unigrams, applying logic operations (AND, OR, 
NO2) to them, and sorting the results. 
5.6 Text/Dictionary Management 
System 
TI)MS' goals are twofold: to provide customi> 
able information extraction/indexing/search tools 
and managerial functions for text data base; and 
to provide an environment for dictionary deveb 
opment and management as well as converting or 
merging existing dictionaries to the intended one 
according to user's specification. 
Because of the big size of each text to be 
stored and lots of keywords to be indexed and 
searched for each text, it requires special stor- 
ing and managing mechanisms. This is also the 
ease for the dictionary management. For the 
extensibility and adaptability, we have devised 
standard dictionary markup language based on 
SGML. Templates (dictionary features, text de- 
scriptors, and relations among those), specifica- 
tions for text/dictionary editor and format trans- 
lator have been also being designed and low-level 
design is being undertaken. This work is being 
coded on PC Windows and will output the first 
draft version this year. 
6 Conclusion 
To this point we described the motivation and cur- 
rent status of the Korean IP, and took a brief look 
at resources and tools. We started the project 
in 1994 to yield version I platform in 1995 and 
are working on version 2 platform. The project 
will continue till the years of twenty first century. 
Although the current status is just an opening 
spot, the long-term goal is to bltikLfully automatic 
servers for Korean language information. Since IP 
plays a key role in the effort, we hope that our 
endeavors would be well geared to the needs of 
nation-wide language engineering. 
References 
Abney, Steven. 1991. Parsing by Chunks. 
Berwick, R., Abney, S., and Tenny, C. (eds.), 
Principle-Based Parsing. Kluwer Academic 
Publishers. 
Berners-Lee, Tim, and Connolly, Daniel. 1993. 
Hypertext Markup Language: A Representation 
of Textual Information and Mctainformation 
for tletrieval and interchange. CERN, USA. 
Jung, Sung-Young. 1996. it Markov Random Field 
based English Part-of-rlhgging System. M. S. 
Thesis, Korea Advanced institute of Science. 
and Technology. (to appear in COLING96.) 
Kim, tliongun. 1995. Korean Syntactic Analysis 
with Probabilistic Dependency Grammar. M. S. 
Thesis, Korea Advanced Institute of Science 
and Technology. 
Kim, aae-Hoon, and Seo, Jungyun. 1994. A Ko- 
rean Part-@Speech 7hg Set for Natural Lan- 
guage Processing. Technical report no. CAIR- 
TR-94-55. KAIST: Center for Artificial Intelli- 
gence Research. 
Lee, Eun-Chul. 1993. An hnproved Method on Ko- 
rean Morphological Analysis Based on CYK Al- 
gorithm,. M. S. Thesis, Pohang Institute of Sci- 
ence and Tcdmology. 
Marcus, Mitchell P., Santorini, Beatrice, and 
Marcinkiewicz, Mary A. 1993. Building a Large 
Annotated Corpus of Fmglish: The Penn Tree~ 
hank. Computational Linguistics, 19(2): 31.3- 
330, 
Oh, Gil-R,ok, Choi, Key-Sun, and Park, Se-Young. 
1994. ftangul Engineering. Seoul, Korea: Daey- 
onngsa. 
Shin, Jung-lto, Ilan, Young-Seok, Park, Young- 
Chan, and Choi, Key-Sun. 1995. An HMM 
Part-of-Speech Tagger for Korean Based on 
Word-phrase. Recent Advances in Natural Lan- 
guage Processing, Bulgaria. 
Shin, Jung-Ho. 1996. Aligning a Parallel Korean- 
English Corpus at Word and Phrase Level. M. S. 
Thesis, Korea Adwmce Institute of Science and 
Technology. (to appear in COLING96.) 
Simpkins, N. K. 1994. ALEP (Advanced Language 
Engineering Platform): An Open Architecture 
for Language Engineering. CEC and Cray Sys- 
tems, Luxembourg. 
1052 
