Automatic English-to-Korean Text Translation of Telegraphic 
Messages in a Limited Domain 
Clifford Weinstein 
Lincoln Laboratory, MIT 
Lexington, MA 02173 
USA 
cj w@sst, ii. mit. edu 
Dinesh Tummala 
Lincoln Laboratory, MIT 
Lexington, MA 02173 
USA 
tummala@sst. II. mit. edu 
Young-Suk Lee 
Lincoln Laboratory, MIT 
Lexington, MA 02173 
USA 
ysl@sst, ii. mit. edu 
Stephanie Seneff 
MIT LCS,SLS 
Cambridge, MA 02139 
USA 
senef f@Ics, mit. edu 
Abstract 
This paper describes our work-in-progress in au- 
tomatic English-to-Korean text; translation. This 
work is an initial step toward the ultimate goal of 
text and speech translation for enhanced nmltilin- 
gual and multinational operations. For riffs pui- 
pose, we have adopted an interlintlua approach 
with natural language understmlding (TINA) and 
generation (GENESIS) modules at the core. We 
tackle the ambiguity problem t)y incorporating 
syntactic and semantic categories in |;he anal- 
ysis grammar. Our system is capable of pro- 
ducing accurate translation of comt)lex sentences 
(38 words) and sentence fragments as well as av- 
erage le.ngth (12 words) grammatical sentences. 
Two types of sysl, em ewJuatiou have 1)een car- 
ried out: one for grammar coverage and the other 
for overall performance. \],br system robustness, 
integration of two subsystems is under way: (i) 
a rule-based part-of-speech tagger to handle tin- 
known words/constructions, and (if) a word-for- 
word translator to handle other system failures. 
1 Introduction 
The overall goal of our translation work is automatic text 
and speech translation for limited-domain multilingual ap- 
plications. The primary target application is enhanced com- 
muilication among military forces in a mull|lingual coalition 
environment where translation utilizes a Common Coalition 
Language as a military infierlingua. Ore' development ef- 
fort was initiated with a speech-to-speech translation sys- 
tem, called CCLINC (Tummala et al., 19951, which con- 
sists of a modular, multilingual structure incmding sI)eech 
recognition, language understanding, language generation, 
and speech synthesis in each language. The system architec- 
ture of CCLINC is given in Figure 1. Note that the systent 
design provides ti)r verification of the system's understanding 
of each utterance to the originator, in a paraphrase in the 
originator's language, before transmission on the coalition 
network. 
\]'his paper describes our current work in automatic 
English-to-Korean text translation of telegraphic military 
messages, u which is an initial step toward the ultimate goat 
rThis work was sponsored by the Defense Advanced Re- 
search Projects Agency. Opinions, interi)retations, conclu- 
sions, and recommendations are those of the authors and 
are not necessarily endorsed by the United States Air Force. 
'~'vVe are also working on Korean-to-English text transla- 
tion on the same domain, which we do not include in this 
paper. 
of producing high quality text/speech translation output. :~ 
The core of our text translation system consists of an anal 
ysis module and a generation nlodule. 'fire analysis module 
produces a semantic frame, which is an interlingua represen- 
tation of the input sentence. The intractat)le ambiguities of 
natural language are overcome by restricting the domain and 
(;ire grammar rules which specify the semal~tic co-occurrence 
restrictions of head categories. The structural diffexence be- 
tween the som-ce (English) and the target (Korean) language 
is easily captured by the flexible interlingua representation 
and the strictly modularized target language grammar tem- 
plate, external to the core generation system. The simt)licity 
of the system enables us to detect problems and provide so- 
lutions easily. Currently the system has a vocabulary of t427 
words. The system runs on a SPARC 10 workstatiou. The 
Korean translation outputs are. displayed on a hangul win- 
dow running on UNIX. In a(hlition, we are in the process of 
porting the system to a Pen|iron laptop running on Linux. 
This paper is organized as follows: In section 2 we (te- 
scribe our system architecture, along with the grammar rules 
which drive the (:ore systein. In section 3 we suimnarize 
the characteristics of our source language text comprised of 
naval operational umssages. In section 4 we give our sys- 
tern evaluation. In section 5 we discuss the integration of 
two subsystems for system robustness: rule-based part-of 
speech tagger to handle unknown words/constructions, an(t 
a word-for-word translator to produce partial transbttions in 
the event of system failure. Finally we summarize the paper 
in Section 6. 
2 System Description 
The core of om ~ translation system consists of two modules: 
the understanding/analysis module, TINA, and the genera- 
tion nlodule, GENESIS. 4 These modules are driven by a set 
of files which specify the source and target language gram- 
mars. The process flow of our text translation system is 
given in Figure 2. 
2.1 Language Understanding 
The language understanding system, TINA, described at 
length in (Seneff, 1992), integrates key ideas fi'onl context 
free grmmnar, augmented transition network and unification 
concepts. With the context fi'ee granunar rules of Fnglish as 
input, the system produces the parse tree of an inlmt sen- 
tence. The parse tree is then mapped onto a semantic frame., 
which plays the role of an interlingua. The parse tree and 
the semantic frame of the input sentence "0819 z uss sterett 
3Refer to (Kim, 1994) for other ongoing efforts in En- 
glish/Korean text translation im;luding (Choi, 1994). See 
(Lee, 1995) for speech translation work with Korean as the 
source language. 
4Both modules are developed under AI{PA sponsorship 
by the Spoken I,anguage Systems Group at the MIT Labo- 
ratory for Comlmter Science. 
705 
@ .__o. L"oo-- I .... 
'I 
~ ,=,...c. ~J L...u..'~ I "'°°°"'"°" 
ENGLISH/CCL 
.......... TR a "S, ~A:P0,N, m 's TE,M, .................... 
COMMAND & CONTROL 
\[,AT* ~A~E ,/)i ...................................................................................................................................... \] 
I i .--°. N 
3i_ ' 'J' 'L 
COMMON COALITION ~i/~EMANTIC"~ I;CONFIRMAnON) j 
LANGUAGE NETWORK I i~_~ FRAME /\] KOREAN 
LINDIEIlSTANDINr " "~ lllgeOr-NmOK 
TO/FROM KOREAN~CCL 
FRENCH/CCL TRANSLATION 8YaTEM 
TRANSLATION SYSTEM ........................................................................................................................ 
FRENCH 
Figure 1: System structure for multilingual speech-to-speech translation 
LANGUAGE 
NDERSTANDING 
(TINA- 
-- I ..,LCSI, I 
MUC-II 
ANALYSIS GRAMMAR 
FOR ENGLISH 
LANGUAGE 
GENERATION 
(GENESIS- 
MIT/LCS) 
MUC-II 
GENERATION GRAMMAR 
FOR KOREAN 
Figure 2: Process Flow of Text Translation 
taken under fire by kirov with ssn-12's" are given in Figure 3 
and Figure 4, respectively. 
As is reflected in the parse tree, both syntactic and seman- 
tic categories are utilized in our grammar specification. Top 
level categories such as 'sentence,' 'subject,' etc. are syntax- 
based, whereas lower-level categories such as 'ship_name,' 
'time_expression,' etc. are semantics-based. The main ad- 
vantage of adopting semantic categories is that we can easily 
specify the co-occurrence restrictions of head categories (e.g., 
the parse tree specifies that the category ships occurs with 
a small subset of nominal modifiers including uss which we 
call 'ship_mod'), and therefore reduces the ambiguity of the 
input sentence. In addition, it provides for easy access to the 
meaning of domain-specific expressions. The parse tree di- 
rectly encodes the knowledge that sterett and kirov are ship 
names, ssn-12 a submarine name, and z stands for Green- 
wich Mean Time. 
As for mapping from parse tree to semantic frame, we re- 
duce all the major parse tree constituents into one of the 
three syntactic roles, i.e. clause, topic, and predicate. All 
clause-level categories including statements, infinitives, etc., 
are mapped onto clause. All noun phrases are mapped onto 
topic, and all modifiers as well as verb phrases are mapped 
onto predicate. However, there is no limit to the number of 
semantic frame categories, and we can easily create new cat- 
egories for a more elaborate representation. In Figure 4, we 
have additional categories like 'time_expression.' Whether or 
not we add more categories to the semantic frame depends 
on how elaborate a translation output is desired. If elaborate 
translations are required, we increase the number of semantic 
frame categories. The flexibility of the semantic frame repre- 
~I~ble 1: Sample English/Korean Bilingual Lexicon 
be V i ROOT i PRES i PAST iess 
V2 V .... \[NG goiss PP ess PRES n PAST ess 
cause_en V2 cholaytoy 
visually AV sikakulo 
cap_aircraft N centhwu cengchalki 
sentation makes the TINA system an ideal tool for machine 
translation of various (i) purposes (i.e., whether a detailed or 
rough translation is required), and (ii) languages (e.g., some 
languages require a more elaborate tense representation or 
honorification than others, and the appropriate categories 
can be easily added.) 
2.2 Language Generation 
The language generation system, GENESIS (Glass, Polifroni 
and Seneff, 1994), produces target language output on the 
basis of the semantic frame representation. It is driven by 
three submodules: a lexicon, a set of message templates, and 
a set of rewrite rules. These modules are language-specific 
and external to the core generation system. Consequently, 
porting the generation system to a new language is confined 
to developing these submodules. ~ 
2.2.1 Lexicon 
Since the semantic frame uses English as its specification 
language, and is the basis for constructing the target lan- 
guage grammar and lexicon, entries in the lexicon contain 
words and concepts found in the semantic frame, expressed 
in English, with corresponding surface realization forms in 
Korean. A sample fragment of a bilingual lexicon is given in 
Table 1. 
2.2.2 Message Templates 
Message templates are target language grammar rules cor- 
responding to the input sentence expressions represented in 
the semantic frame, For instance, the word order constraint 
of the target language is specified in this module. A set of 
message templates used to produce the Korean translation 
from the semantic frame in Figure 4 is given in Table 2. 
5A pilot study of applying GENESIS to Korean language 
generation can be found in (Yang, 1995). 
706 
P~:EV 
pl-e acljtlnct ,~ull,leC t 
I 
gmt ttma ,i ahlp 
I 
num~P tc ~rnt us~i ;hi 9 ~laat I 
st~rlt L, rlc~, 
Full parse 
statement 
pan.rye 
vp_t ~ken urlt3op 1 i i-e, 
vtake under fll'~, v btl agarll: v v4tsh lr~tl'L,r,~rqt 
' v..btj q_5hlD v M, lth q ~;h iF) 
~h~ps r~h~p5 
i 
shl 0 ~la~lo sLlbmtlrine nat.r, 
Figure 3: Parse 'lS:ee 
{c stotement 
:time_expression {p numeric_time 
:topic {q g~t 
:pred {p numeric 
:topic n19 } } 
:topic {q ship_home 
:hOme "sterett" 
:pred {p ship_mud 
:topic "uss" } } 
:subject I 
:pred {p token under_fire 
:mode "pp" 
:pred {P v_.by 
:topic {q ship_nnme 
:nome "kirov" } } 
:pred {p v uith 
:topic {q subnorine_nome 
:none "ssn-12s" } } } } 
0UI9 z uss sterett token under fire by kirov ~ith ssn-12s 
PAIIAPHP~SE: UI9 z USS Sterett token under fire by glrov uith SSH-12s 
T~StmIO.: 8"I l~ ~ *I~ "I~ v ~I~oI el~-~l ~I~ ssn les~ 
Figure 4: Semantic Frame and qi'anslation 
r • Table 2: Smnple Message Fcmplates 
~~ statement 
topic 
predicate 
nl)-ship_mod 
ship_mud 
\[ nt)-missih;-name 
missile_nante 
np-by 
np-with 
:topic i :predicate ta 
~ntifier :noun_phrase 
:topic :predicate 
:topic :nounq)hrase 
:topic 
:topic noun_phrase 
:topic 
:topic ey ~,y :notn@~ 
:topic lo moun_phrase 
Template (a) says that a statement consists of a subject 
tbtlowed by a verb phrase. Note that all of the entries in a 
message template are optional so that a statement need not 
contain a suhject or a verb phrase. Template (c) says that a 
verb phrase consists of an object followed by a verb. Other 
templates can be interpreted in a similar manner. 
2.2.3 Rewrite Rules 
The rewrite rules are intended to capture surt'aee phono- 
logical constraints and contractions, in particular, the con- 
ditions under which a single morpheme has different phono- 
logical realizations. In English, the rewrite rules are used to 
generate the proper form of the indefinite article a or an. 
This module plays an important role in Korean due to m\]- 
merous instances of phonologically conditioned allonmrphs 
in the language. For instance, the so-called nominative case 
marker is realized as i when the preceding morpheme ends 
with a consonant, and as ka when the preceding morpheme 
ends with a vowel, as illustrated below. Similarly, the so- 
called accusative case marker is realized as ul after a eonso- 
nan% and as lul after a w)wel. 
N~inative (2 as eT~d~:~'~t t-i v e~T\], ~ 
a consonant j dOnll-/ \] John-u/ ) 
\[ t~ouowlng:gv~l \[ MariTk~ \[ John-lul \] 
3 Data Summary 
Our source language text is called MUC-II data, and consists 
of naval operational report messages. 6 There are 1(}5 mes- 
sages for system development and 40 messages set aside for 
system evaluation. These messages tieature incidents involv- 
ing different platforms such as aircraft, surface ships, sub- 
~MUC-II stands for tire Second Message Understanding 
Conference. 
707 
Table 3: _Training Data (Data for System Develop- 
merit) _ 
Data Set 
Total 
o.riginM Data Modified Data 
, . '101 " 105 
.............. ;"" ....... -I, .. 154= -. 
%' ~,t0t., ....... 115...' i~ 
'::' '~: : ::~6\] :~' ::" '::" ' ::".-" " :i' ~g,,=.~> ,.~. 
303 , .- 337 
marines and land ~argets.. There are 12 words/sentence ~av- . 
erase) and 3sentences/message (average) (Sundheim, 1989). 
The original messages.are highly telegraphic with many im 
stances of sentence f/-agme,/ts;asillustrb~ted in (11. 
(t) At 1609 hostile fm-(;es \[aunched massz~/e recon effort from 
captured airfield against ctf 177 units transiting toward a 
neutrM nation• Humint sources indicated 12/3 strike acft 
have launched (1935z} enroutebattle force: Have positive 
confirmation that battle \]brce is targeted (2035z): "Consid: 
ered hostile act 
Table 4: Current System Specifications 
Anal~-sis Size of , Size of grammar Total ' " ' " ?- ' i . lexicon (Ninnber.0f.categorie~ 0 
" 206 ,-1~27 ' ' •1297" ' 7- 
..\[ Generation ...997., : ,76.3' -'.. 
• 215 .... '" : ' :"" "- "" '": - " .::",.-:: .... : 
794 Table 5: Transial/ion S tatistms tor ~amlng Data 
\[ # of Sentences 
~ated Sentences : 
mg sentence, and nP:(tating appropriate do.m~s (See. (Sefmff, 
1992) for details/ Table 4 gives the s~atistics of the current 
system in terms of the size of the analysis lexicon/grammar 
and'generation lexicon/gratnmar. 7 The translation s~atistics 
for the training data-are shown in Table 5. 
4.2 Evaluation -_ - ..... 
"...... ...... ..- ' . ~ .. .. " ,. We have carried' ottt .two fypes of:6~si, em evaluations. The Foti.~}ag.li:."Q};i'g.in•a.'.}.:..~s~g~(.:}t't~er~ 
l'%,~;c0r~..P.'Opdl~}g, m OdL 7 -i..:i3'ml~..'eacgh!a\]fi-0n ,~peeificalty tes~:.g~:~mmar ~overage. and :the 
`1ma`::m.ess.a`ge::,~x~-i$. ~}~:~}~1.'-a`.`)~L~I`1g~s9~.~\[~,):~s~ ~Pe.~pm%~ne~...: .:. ~sbeo~ifi.:6v~.l):ia~£d~;,}~egfS:6~rerg.lt ;sys~mh..perro"i.~i\]'a}i(~e!? :' ,.:.= .-.. : 
cbuftgefpart -of the.origiiial., inessTgge .(1).. Nd~. tI/£t:~ti%qmx'-.:~ :,:.":~,'~:,~,,:' >-:.. ; ;";*..<:!,~- <:-), :: :+:: :<.:, ',¢"- :~,.~~: =: ~! ~::" : .:"" 
.der,hned parts. *I.~, tI*e-roo(hffed rues'sage -~re .mtsmng. aft:..the'. :. 4 2"1." ::E~iuti~iOh'~f ~'hi~,':~l~mn~ll~."- ',. , : ..... 
o'rl~m.al-nie~shge .." '..: ' :-':'- ' ': .... .,7 ,;" :~' -'..- -.'-:" : ~ :, v.~>. ...... : :: ...... ' ..... %,> -:," ,..-..~":~ ..... .:,~-.' 7 . . .... .... : :-:.,,< :'. : <:7 ~i< :):. i". ;: :i-':; ,* :: : .{ :% : ..,-! :7. >.<, : -< :- :. ::-~, ..,'% <..%~-oxb~!;ag~. ::! :: .-:- .-: ':,:,::i.:: ."~ .<:,, ::." "; 2 ,. i ,'.: ,. :.... 
from _a eaptur:ed airfield agginst~ ctf 177 nHits: \[~-gi'!s'i/ihg .to2., :"M:gOaH-!fl~#-:dCtabg~Se"ida~: s.0t:: K?~,. g,hieh r,cd~4ig<./no, am: 
1 8 ward a neutrMnat'on, ttumint sources ihdicated 12.sti'ike knowfiwords'in an in-house exl;ei\[ment. ; h{ ~hdexperhndnt; 
aircraft have launelied (1935z) enroutelo t£_e battle force, we asked the; subjects 'to stud~/a tist ofldata~ set R .MUC.-II 
CTF 177 has positive confirmation that the: battle force is. sentendesland then create abo{ifi 10 MUC:II-likescntences o~i 
,..q,o-o ( '~ " " ' e" hbstile a t their own Subjdetswere told to create seh~ences Whict~ il- ~_'~,_ted , 20a5z, Thzs,.~s consider d a c .... 
". ..... " - . . lustrate the.general .style 0f the MUC/'II s6fiterices a;nd which 
MUC-I\[' (tat~ a\]s0 .ha~d other ..typical feainres 'of raw ~ex~. only use the ~voccb~l!~}~y items oqen<ring in.the.example S?.n- 
Thei'e ~£re *itdte a~fe~"{.dst:ance~ of- e~mntex ~e~'te~ces ta~ : mnces " W/~ e01teeted: la.4MUCc.lLlike . sb.ntefices iii this dxper- 
• " ....... . .'-;i:, ,< ~;~,- "~ ~'a~v ",:~,:a d, ala:.~.,d.n~i;~,.~v~a ~ ~,a:~2:;..., ",lmen~:.-;W..{:~'h'erke~o, luatCd{tle~ys~m S pm'forraa~ce on {lmse 
'%i ........ , ..... ~ :::--'-',;'~, :, < :::.. .~...: ,.:': ...... "7:;... :-'. :': - ....... : ::..=. ::,? s.enuen~S. ~<!,.~.~:.um, m~g~& •. ~t~es ~Orrt.Ea~.Sva,6ger:@7 ~. 
5.,<::., .i ? ~'i •'. .:: ;, :% .--'>:" ":":<:---"- ~ :,".~ ...- ':;..: v.. -:. ,<:~-,:;:-o{:tlie,K*:~eb~e<~::#e.,ti~:~e.//~g~e{.,g&t. dib:ba/~g;•.g-l'%;~ 
'•', ., <. :..~':2.,"2q,-2<'.! .. 7":.<~::i -:.: .~i:), ~ ,-:e, ,-:--.,:-: ,i.i, =.', :/i ,--:~{t,e~:gi~eoi;~l).~i;a~l&~;d~ p ;'=-.i?: 9'.:"~'e.:. • :: ,: : :',:-; ,-;. _(3) T~B- ~Ss-gmgtle~3~ed, stfit~ ~,es¢-~I-.fr~¢.~,~erg.e.ngag>id.~,...;; >5-:'..,.,; :,-,:' - '~:::,:~ .--:. ~. :_. ,":,.=..".> :.-,:.:" ": ::~.~. :'-. ,: .•-:'." "',: :'..:.' ',~. 
" " - ' ~ • ...... ~.hll' "e~fl .... ~ti ~ ' - ,, - ".', ...... • - -: ~...~.,~,~..',~:~.~,~..~.~ ~.vatu~wut~ ..... , "' . ........ (~stand -tin .<~, ..f~efl~ty)- - e. dtm ng %~nke.~agamst~-xxx ......................... .. :, .......... - ......... . ..... o • . 
mmrillb, c.amp ' " ' . ,: ::"-: .. -., . ~.' - :.. ~"-~ - - , ' : -' ~. J >.- :-' ~'llh~ -resul~s.':6f..sy~m evM,!.g{aoft On~ .ttie ~ -tgNs dat~: {({ore 
• . .... -. "., - "-, ' ; ". :. :-, ~-'."smtmg-0f 40 messagesl ii1 se~tehee's~'are ghoWn ih":31abte"L. 
(4) Kirov locked on w{.th ,fire c0i, tr0l radm" and:~fi, ed :t'm'pedd..: ~l{e firs:t \]~oiii{ go n6ee is-~ta-M;•d~d 'sysedt~i.;,Xrsed 34: 8%'0( I he 
m spetmei~on.. --. ...... : ....... .. ..... - .". test'sentene68'.wh~eh.;have.no .ur/kfiown' ~vordg,: f~his' figm:e is 
• :~ : -:: ', " .-:"::, ;.: (.:--."} -" ; ,, .' . :i. ; -. " : '.7; . somewhat \]owdr \],f~kn, the Corresp6nditig figure :for.data~ set 
'(5)/' ..... ..~ _-.. ')" ;:. :< . :... :'i: .> ".. :. ~The: I.,-:A* (50.6%) be.~iitise,'~*hdteSt get-is,h'ard~..Y~tm~/'daf, a i sd.t A*i 
deliberate-harassment' of uscgc spencer bY tiost'ite kimv• eric :: ?@he: {esi: sentences g~/~aomi~l~{eb'ne~ @lieFa~' :A.*-a~,;ka,.~,:oa 
da\[ngefs £n atready:ftagite.1561i{ieal/military.balm~ee •between ' were fabriegted by' stndyi:ng 'A.sentet/ees.)Ttm Sec;nd point 
"/:/,,;ii;~"~:~,a ¢..ie\]~~Di,.~a;i-~:': --. .... ,. • 7 -,..-. = 7 :.: -- 'to note-isthat.s2stem-faitm-e i~d;uein-lgr e art to the res- 
: ...':'..: :" ".' =" ,7, ,i". %.'- :"-::: " ",.':-",'. " ";::/-" ,/.-.-. '.: ~-;:"7':~ .:..i. :' :enC~e:.~zurtkrioWn:w~r(ls,..ln-.:taet,~aS"T~bl~..g-~i~diaates,abmlt 
oti~in-al an?t:tl~6'/iGdlfie~t MUU-II~'@~glaa ' Wg-discnss ,~he~de-". :eaff'tiog: tm, ndle:un.knqwn..words'.a%ttlls:ti:~e.. We.?dlscdsg.ouv 
't&~s~ £f :}ilb,,dx{)~ :l~vti~(o'{{~,i~i~t , t*~ainidg" fii-',{he:'f61i0wi'ng:s~e" .7: !gi~g61i~g,effo{ ts <(o.:}@k!e-fht£ p.{10~!~'i}.i>\[ii.,ge.{~li?ngi :~,.{:..: .¢:}, .:: 
tion " . - "-~ - " ' .. :-i~/:. ':: -:.- %, ".:i ::,- ";:< : ...... 7~- ;5.! ;.:: _v.;~:,.,<..: ...,. ,;-..<..:. -: i!.,;,!,'-..-~.:<~ 9:..~/%?,., !:' 
• .- . . . .- . .... . i - • "4:3"! Eflldency . " ., : :.;,. :..:> >:'."-,: ;:.::-" /;,,. 
4 W!mini,~g ' and ~va!uation L~gbly.due toour effort toreduce-t.he ambigulty.of,ti{e/irqmt 
: , • :, ,-. ,_ :-.. :.......... • . - ..:.,)a~.. . <.:. :,..-. .' -:sen~e~mes,-1,l~es2,stema ru.ns etticicntky F0~: an ayernge length 
4" 1, ;...,Tr.ai,~ing..5-.,: .... .- . .... ~. :. :7': ' i : .~." .,:- :s~nt~ncei: co n/glning.:l-2 :v)06/s.,"%g ,~dkes'_Ul~6/tt .4.~ S~6f,~Ut/, 
'We fir,t p'ae'(~;i:d}{'ed<'{Im- MU c~iilddt a :c6tpus intaTth~6 {la~ .: •:.. ::.i;:. ;:: .: i ::.::', :: :)/::.:.:(::'.,,., ..:.: :! '•'. '< :.',./: ," :: : , : ..,/::..'(< ,: ;::..,.!::-.: ..:-/. : •. 
~se'ts ' \]\[a gd'~ti'&"-"il;e ~tg~" "ha@% sc4 'of:-:15"4 MUC t\[ like ' ; Table"~t :gives the"numbet" Of" re~-erinin{/f:v~tei"g}ie~ .in .~,-: ~*:.. ~... . . e..~ • , '. - . : . = - -.' ,.. " . .: • • ... .... ,.. :. ,~, ... :. . g ... 
' " " s fro% w&-e/d~elited iff an:in h i~se ex rWnent We the anal ms ~v~mtaar The actttal number of ritt~ ~s mu h Sentence'-~ " . ....... - o_ . me ..- , ..... Y -g ....... • .. .... c 
"W~ll dNcus's' {h6ge :~e~/tgn~beg '(wh'teh, ~o~imi, ige ~(a~a~ ~t'A ~ ) in ..... g.re~,t:m;. •bde~aus~ 5FINA~ M10.ws ci4osa~ ~blli*iatiofi " tlie'g'tid~dn ' • . ..... .., '-. --.-' .... " .---'. " • • ,,-, ' ..... ,v .v - - ,- .- ~,. ,.. l ,._ ... ~ .. .... g 
Seet, ing '4 A' s mrnatV o traini~ d ta ~s wen m of eorotrton elements on %he V~ ti* hm~d rode. f. roles .. S . .... 21,,,. u of: ur . a ~" ': .......... • ...... " " .~ ,-g.- ' :" ,o '. ~'.: ~ 
:Table"3;::: g;hera: ~t~6.~{,i'iiid>gis :a(e:~h~ h~mb~r of ~entgncesi . .(: -(SefiCff,! 2{J92.~-f0r mibre':detaiis;gb6i~t': @oss-p611ina{ioh\];,/.'7:> 
we have trained::ehe system onMl the etairiing:datgl-The. ' : ' '~A~"we~llaisci~'s~ in'{i{~{ii6:~l~g~ction:\]~.igd:ifficttl};{b iise 
analysis.rutes.a~e ~levelopedk/y hand, .based: on observed pat-" a MUC-I\[ database to evaluaee gramnigr coverage Uecguse 
terns in :the dgta/ These :rules are. then converted into a net- many MUC:II sentences fail b ased.strictlv on tile fact that ~ 
work Structure., Pr0bability assignments on all arcs in the they contain unknown Words: i.e., w0rds which are not in the 
network are obtained autorffatically by parsing each train- system's lexicon• 
708 
Tabh; 6: Data Set A* Evahmtion Results 
\[#}ff CVo~t~'~ted Sentences ~- "~---~-~-~~_71~7g-(91.0%) \] 
Table 7: Test Ewduation Results 
of Sentences with No Unknown Words ' ' 
of Parsed Sentences ~~~ 
of Correctly Translated Sentences 
translate, it takes an average of 2.28 seconds to translate a 
sentence containing 16 words. For a fairly complex sentence 
containing 38 words, it takes about 4: seconds to translate. 
Some examples of Korean translation output are given in 
Figure 5. The system runs on a SPARC 10 workstation, and 
the som:ee ('ode is written in C. Korean translation outputs 
are displayed on a hangul window running on UNIX. 
5 Toward Robust Translation 
At the moment, our s~cstem is not capable of dealing with a 
sentence containing (i) in\]known words, cf. Section 4.2, and 
(~fi) unkuown constructions, of. Section 4.1. In this section we 
iscnss our or>going efforts to overcome these deficiencies: 
Integration of a part-of-speech tagger to handle unknown 
words/constructions, mM a word-for-word translator to cope 
with other system tSihn'es, of. (Frederking and Nirenburg, 
1995). 
5.1 Integration of Part-of-Speech Tagger 
Regarding the unknown word problem, an obvious solution 
is to expand the lexicon. Concerning the problem involving 
mtknown constructions, we could easily generalize the gram- 
ram: to extend its coverage. However, both of these sohltions 
are t)roblematic. Ilandling the unknown word problem by 
increasing the size of the lexicon is not that straightforward 
given that most unknown words are open class items such as 
nouns, verbs, adjectives and adverbs. In addition, one can 
not generMize the grammar without side effects. Due to the 
highly telegraphic nature of the MUC-II data, generMizing 
tile grammar will increase the ambiguity of an input sentence 
greatly, of. (Grishman, 1989). :) tIence, we need alternative 
solutions to deal with unknown words and unknown con- 
structions. Tile most desirM3le solution is to (i) leave the 
current grammar intact since it eifieiently parses even highly 
telegral)hic messages, and (ii) tackle unknown words and un- 
known constructions by the same mechanism. 
A potential solution to the unknown word problem is to: 
Do part of speeeh tagging and replace unknown words with 
their parts-of-speech, and bootstrap the parts-of:speech (in- 
stead of the actual words) to the analysis grammar. The 
unknown words would be replaced in the sentence string 
with their corresponding part-of-speech tag, and the seman- 
tk: grammar woukl be auginented to handle generic adjec- 
tives, nouns, verbs, etc., intermixed in the rules at appro- 
priate positions. The idea would be to include just enough 
semantic information to solve the ambiguity probleln, effec- 
tively anchoring on words such as ship-name that have high 
semantic relevance within the dommn. 
This approach might also be effective as a backoff mech- 
anism when the system fails to parse a sentence containing 
only known words. A set of semantically significant vocabu- 
lary items could be tagged as "imnmtable", and all the words 
in tile sentence except these anchor words would be converted 
"Recall that we resolve the ambiguity problem by con- 
straining the grammar with semantic categories. 
Table 8: Tagger Evaluation on TEST Data 
\[~FYv~-all Accuracy ~2Tn~W~.d-\] 
/ I A~:euracy / 
Before ~IS'aining ~j~287-~874°~~8~j%F~ 
After Training I | \]249/1287 (97%) ~ 70/82~ 
TrMning II ~.~ 1/~_87~98°/~ ~1~2-~ 
to part-of-speech prior to a second attenlpt to parse. The 
sanle gralllnlar would be used in all cases. 
For the solution sketched al)ove, we have evMuated the 
Rule-Based Part-of-Speech Tagger (Bri\[1, 1992) on the test 
data both before and after training on the MUC-II database. 
These results are given in Table 8. Tagging statistics q)efore 
training' are based on the lexicon an(t rules acquire(l fl'om 
tile BROWN CORPUS and the WAI,L STREET JOUR- 
NAI~ CORPUS. Tagging statistics 'after training' are divided 
into two categories, both of which are based ou the rules ac- 
quired from training on data sets A, B~ and C of the MUC- 
II database. The only difference between the two is that in 
one case (After Training I) we use a lexicon acquired fi'om 
the MUC-II database, and in the other case (After ~\[¥ain- 
ing II) we use a lexicon acquired ffoin a combination of the 
BROWN CORPUS, the WALL STREET JOURNAL COR- 
PUS, and the MUC-11 database. Since the tagging result is 
quite promising, despite the fact that |;lie training data is of 
modest size, we are planning to integrate the tagger into the 
analysis module. 
5.2 Integration of Word-for-Word 
Translator 
Even though implementii,g tim part-el:speech tagger and ex- 
tending the analysis grantiltar to accept parts-of-speech as 
terminal strings will increase tile granmmr coverage, it; is an 
ahnost impossible task to write a grammar which covers all 
freely occurring natural language texts, let alone haw; a re. 
bust parser to (teal with this inadequacy, t° Despite this dif- 
ficulty in designing a complete translation system, an ideal 
translation system onght to be able to produce translations 
which are useflfl under auy circuinstances. Therefore, we 
are integrating a word-for-word translator I ~, which provides 
tools to akl a human translator, as a fallback system. 
Figure 6 shows the planned robust system architectur% 
with the part-of-speech tagger and the word-fl)r-word trans- 
lator integrated into the core understanding/generation sys- 
tein. Note that the system will provide an indication or flag 
to the user showing whether the translation is produced by 
TINA/GENESIS or by the word-tbr-word fallback system. 
6 Summary 
In this paper we have described our ongoing work in au- 
tomatic English-to-Korean text translation of telegraphic 
messages. This is a part of our overall effort in text and 
speech translation for limited-domMn multilinguM applica- 
tions. We have described the system architecture (Section 
2), the source language text (Section 3), and the system eval- 
uation results {Section 4). We have also discussed ideas on 
how to make the system robust, and proposed two specific 
solutions: integration of a part-of-speech tagger and a word- 
for-word translator (Section 5). 
7 Acknowledgements 
We would like to acknowledge the following people for their 
contributions to this project: .Jack Lynch~ Beth Carlson, 
Victor Zue, and SungSim Park. We also would like to 
thank Beth Sundheim of NRaD, Professor Ralph Grishman 
1°See (Sleator, 1991) for a design of a robust parser which 
handles unknown constructions. 
U The word-for-word translator is being developed by 
GARJAK under a subcontract. 
709 
at 1609 z hostile force s launched ~ossive retort effort from captured airfield \ 
against ctf 177 unit s transiting touard o neutral notion 
PARAPHRASE: 1609 z hostile forces launched ~ossive reconnaissance effort fro~ c\ 
optured airfield ugoinst ctf 177 units transiting to~ord o neutral notion 
TRANSLATION: 16~1 19~'~ .~'~ xl~ ~ ~ ~ol ~t~ ~I~%~.~@~ @~d\ 
@?F~ %~ o1~o~ ~ ctr 1~ -v-c~l ~ c~. ~%~ ~ ~1~ 
lost hostile gcft in vicinity 
PARAPHRASE: lost hostile aircraft in vicinity 
~RA.StArZO.: ~1 "~ nt~l~ ~ ~ ~1~?1 
t~o uss america hosed strike escort f dash 14 suere engaged by unknown number ' 
of hostile su dash 7 aircraft near land 9 boy lporen island 2 target facility r' 
poren uhile conducting strike against xxx guerilla camp 
PARAPHRASE: 2 USS America-based strike escort F-14s ~ere engaged uhile conducti 
ng strike against xxx guerilla camp by hostile Su-7 aircraft unknoun number of' 
near Land9 Boy ( Island2 target facility ) 
xlN) @~q~l;q ul~l~l ~ SU-7 ~1~?1~1 51~ Nl~e~ xxx ~8~1 ~ @\ 
Figure 5: Sample Korean Translation Output 
Figure 6: Robust Translation System 
of NYU, Proh~ssor Key-Sun Choi of KAIST, Korea, and 
Willis Kim of MITRE. Beth Sundheim provided us with the 
MUC-II data as well as technical reports documenting the 
data. Dr. Grishman provided us with his grammar, dic- 
tionary, and semantic models tbr the MUC-II data. These 
materials helped us understand the linguistic properties of 
the MUC-II corpus. Dr. Choi provided us with documen- 
tation and software for KAIST'S MATES/EK English-to- 
Korean machine translation system. Kim provided us with 
electronic English/Korean dictionaries as well as a report on 
his work. 
References 
Eric Brill. 1992. A Simple Rule-Based Part of Speech Tag- 
ger. Proceedings of the Third Co@rence on Applied Nat- 
ural Language Processing. Trento, Italy: ACL. 
Key-Sun Choi, Seungmi Lee, Hiongun Kim, Deok-Bong Kim, 
Cheoljung Kweon, and Gilchang Kim. 1994. An English- 
to-Korean Machine Translator: MATES/EK. Proceedings 
of the 15th International Conference on Computational 
Linguistics. Kyoto, Japan. 
Robert Frederking and Sergei Nirenburg. 1995. Three Heads 
are Better than One. C-STAR II Meeting, Pittsburgh. 
James Glass, Joseph Polifroni and Stephanie Seneff. 1.994. 
Multilingual Language Generation Across Multiple Do- 
mains. Proceedings of the 1994 International Conference 
on Spoken Language Processing. Yokohama, Japan. 
Ralph Grishman. 1989. Analyzing Telegraphic Messages. 
Proceedings of the 1989 DARPA Speech and Natural Lan- 
guage Workshop. Cape Cod. Massachusetts. W. Kim and W. ,thee. 1994. Machine \]\]:anslation Evalua- 
tion. Seoul, Korea: MITRE. 
Youngjik Lee, Young-Sum Kim, Jung-Chul Lee, Joon-Hyung 
Ryoo and Joe-Woo Yang. 1995. Korean-Japanese Speech 
Translation System for Hotel Reservation - Korean front 
desk side. European Conference on Speech Communica- 
tion and Teehnoloq.y. Madrid. Stephanie Scneff. 1992. TINA: A Nat,ral Language Sys- 
tem for Spoken Language Applications. Computational 
Linguistics, 18(1): 61-88. Daniel D.K. Sleator and Davy Temperley. 1991. Parsing 
English with a Link Grammar. CMU-CS-91-196. Beth M. Sundheim. 1989. Navy Tactical Incident Report- 
ing in a Highly Construined Sublanguage: Examples and 
Analysis. Technical Document 1477. Naval Ocean Sys- 
tems Center, San Diego. 
Dinesh Tummala, Steph~%nie Seneff, Douglas Paul, Clifford 
Weinstein, and Dennis Yang. 1995. CCLINC: System 
Architecture and Concept. Demonstration of Speech-to- 
Speech Translation for Limited-Domain Multilingual Ap- 
plications. Proceedings of the 1995 ARPA Spoken Lan- 
guage Technology Workshop. Austin~ Texas. 
Dermis W. Yang. 1996. Korean Language Generation in an 
Interlingua-based Speech Translation System. Technical 
Report 1026. MIT Lincoln Laboratory. 
710 
