File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2119_metho.xml
Size: 26,936 bytes
Last Modified: 2025-10-06 14:14:10
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2119"> <Title>Automatic English-to-Korean Text Translation of Telegraphic Messages in a Limited Domain</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The overall goal of our translation work is automatic text and speech translation for limited-domain multilingual applications. The primary target application is enhanced commuilication among military forces in a mull|lingual coalition environment where translation utilizes a Common Coalition Language as a military infierlingua. Ore' development effort was initiated with a speech-to-speech translation system, called CCLINC (Tummala et al., 19951, which consists of a modular, multilingual structure incmding sI)eech recognition, language understanding, language generation, and speech synthesis in each language. The system architecture of CCLINC is given in Figure 1. Note that the systent design provides ti)r verification of the system's understanding of each utterance to the originator, in a paraphrase in the originator's language, before transmission on the coalition network.</Paragraph> <Paragraph position="1"> \]'his paper describes our current work in automatic English-to-Korean text translation of telegraphic military messages, u which is an initial step toward the ultimate goat rThis work was sponsored by the Defense Advanced Research Projects Agency. Opinions, interi)retations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Air Force.</Paragraph> <Paragraph position="2"> '~'vVe are also working on Korean-to-English text translation on the same domain, which we do not include in this paper.</Paragraph> <Paragraph position="3"> of producing high quality text/speech translation output. :~ The core of our text translation system consists of an anal ysis module and a generation nlodule. 'fire analysis module produces a semantic frame, which is an interlingua representation of the input sentence. The intractat)le ambiguities of natural language are overcome by restricting the domain and (;ire grammar rules which specify the semal~tic co-occurrence restrictions of head categories. The structural diffexence between the som-ce (English) and the target (Korean) language is easily captured by the flexible interlingua representation and the strictly modularized target language grammar template, external to the core generation system. The simt)licity of the system enables us to detect problems and provide solutions easily. Currently the system has a vocabulary of t427 words. The system runs on a SPARC 10 workstatiou. The Korean translation outputs are. displayed on a hangul window running on UNIX. In a(hlition, we are in the process of porting the system to a Pen|iron laptop running on Linux.</Paragraph> <Paragraph position="4"> This paper is organized as follows: In section 2 we (tescribe our system architecture, along with the grammar rules which drive the (:ore systein. In section 3 we suimnarize the characteristics of our source language text comprised of naval operational umssages. In section 4 we give our systern evaluation. In section 5 we discuss the integration of two subsystems for system robustness: rule-based part-of speech tagger to handle unknown words/constructions, an(t a word-for-word translator to produce partial transbttions in the event of system failure. Finally we summarize the paper in Section 6.</Paragraph> </Section> <Section position="5" start_page="0" end_page="705" type="metho"> <SectionTitle> 2 System Description </SectionTitle> <Paragraph position="0"> The core of om ~ translation system consists of two modules: the understanding/analysis module, TINA, and the generation nlodule, GENESIS. 4 These modules are driven by a set of files which specify the source and target language grammars. The process flow of our text translation system is given in Figure 2.</Paragraph> <Section position="1" start_page="0" end_page="705" type="sub_section"> <SectionTitle> 2.1 Language Understanding </SectionTitle> <Paragraph position="0"> The language understanding system, TINA, described at length in (Seneff, 1992), integrates key ideas fi'onl context free grmmnar, augmented transition network and unification concepts. With the context fi'ee granunar rules of Fnglish as input, the system produces the parse tree of an inlmt sentence. The parse tree is then mapped onto a semantic frame., which plays the role of an interlingua. The parse tree and the semantic frame of the input sentence &quot;0819 z uss sterett 3Refer to (Kim, 1994) for other ongoing efforts in English/Korean text translation im;luding (Choi, 1994). See (Lee, 1995) for speech translation work with Korean as the source language.</Paragraph> <Paragraph position="1"> 4Both modules are developed under AI{PA sponsorship by the Spoken I,anguage Systems Group at the MIT Laboratory for Comlmter Science.</Paragraph> <Paragraph position="3"/> </Section> </Section> <Section position="6" start_page="705" end_page="706" type="metho"> <SectionTitle> TO/FROM KOREAN~CCL FRENCH/CCL TRANSLATION 8YaTEM TRANSLATION SYSTEM ........................................................................................................................ FRENCH </SectionTitle> <Paragraph position="0"> taken under fire by kirov with ssn-12's&quot; are given in Figure 3 and Figure 4, respectively.</Paragraph> <Paragraph position="1"> As is reflected in the parse tree, both syntactic and semantic categories are utilized in our grammar specification. Top level categories such as 'sentence,' 'subject,' etc. are syntaxbased, whereas lower-level categories such as 'ship_name,' 'time_expression,' etc. are semantics-based. The main advantage of adopting semantic categories is that we can easily specify the co-occurrence restrictions of head categories (e.g., the parse tree specifies that the category ships occurs with a small subset of nominal modifiers including uss which we call 'ship_mod'), and therefore reduces the ambiguity of the input sentence. In addition, it provides for easy access to the meaning of domain-specific expressions. The parse tree directly encodes the knowledge that sterett and kirov are ship names, ssn-12 a submarine name, and z stands for Greenwich Mean Time.</Paragraph> <Paragraph position="2"> As for mapping from parse tree to semantic frame, we reduce all the major parse tree constituents into one of the three syntactic roles, i.e. clause, topic, and predicate. All clause-level categories including statements, infinitives, etc., are mapped onto clause. All noun phrases are mapped onto topic, and all modifiers as well as verb phrases are mapped onto predicate. However, there is no limit to the number of semantic frame categories, and we can easily create new categories for a more elaborate representation. In Figure 4, we have additional categories like 'time_expression.' Whether or not we add more categories to the semantic frame depends on how elaborate a translation output is desired. If elaborate translations are required, we increase the number of semantic frame categories. The flexibility of the semantic frame repre- null ~I~ble 1: Sample English/Korean Bilingual Lexicon be V i ROOT i PRES i PAST iess V2 V .... \[NG goiss PP ess PRES n PAST ess cause_en V2 cholaytoy visually AV sikakulo cap_aircraft N centhwu cengchalki sentation makes the TINA system an ideal tool for machine translation of various (i) purposes (i.e., whether a detailed or rough translation is required), and (ii) languages (e.g., some languages require a more elaborate tense representation or honorification than others, and the appropriate categories can be easily added.)</Paragraph> <Section position="1" start_page="705" end_page="706" type="sub_section"> <SectionTitle> 2.2 Language Generation </SectionTitle> <Paragraph position="0"> The language generation system, GENESIS (Glass, Polifroni and Seneff, 1994), produces target language output on the basis of the semantic frame representation. It is driven by three submodules: a lexicon, a set of message templates, and a set of rewrite rules. These modules are language-specific and external to the core generation system. Consequently, porting the generation system to a new language is confined to developing these submodules. ~ Since the semantic frame uses English as its specification language, and is the basis for constructing the target language grammar and lexicon, entries in the lexicon contain words and concepts found in the semantic frame, expressed in English, with corresponding surface realization forms in Korean. A sample fragment of a bilingual lexicon is given in Message templates are target language grammar rules corresponding to the input sentence expressions represented in the semantic frame, For instance, the word order constraint of the target language is specified in this module. A set of message templates used to produce the Korean translation from the semantic frame in Figure 4 is given in Table 2.</Paragraph> <Paragraph position="1"> tbtlowed by a verb phrase. Note that all of the entries in a message template are optional so that a statement need not contain a suhject or a verb phrase. Template (c) says that a verb phrase consists of an object followed by a verb. Other templates can be interpreted in a similar manner.</Paragraph> <Paragraph position="2"> The rewrite rules are intended to capture surt'aee phonological constraints and contractions, in particular, the conditions under which a single morpheme has different phonological realizations. In English, the rewrite rules are used to generate the proper form of the indefinite article a or an. This module plays an important role in Korean due to m\]merous instances of phonologically conditioned allonmrphs in the language. For instance, the so-called nominative case marker is realized as i when the preceding morpheme ends with a consonant, and as ka when the preceding morpheme ends with a vowel, as illustrated below. Similarly, the so-called accusative case marker is realized as ul after a eonsonan% and as lul after a w)wel.</Paragraph> <Paragraph position="3"> N~inative (2 as eT~d~:~'~t t-i v e~T\], ~ a consonant j dOnll-/ \] John-u/ )</Paragraph> <Paragraph position="5"/> </Section> </Section> <Section position="7" start_page="706" end_page="708" type="metho"> <SectionTitle> 3 Data Summary </SectionTitle> <Paragraph position="0"> Our source language text is called MUC-II data, and consists of naval operational report messages. 6 There are 1(}5 messages for system development and 40 messages set aside for system evaluation. These messages tieature incidents involving different platforms such as aircraft, surface ships, sub~MUC-II stands for tire Second Message Understanding Conference.</Paragraph> <Paragraph position="2"> marines and land ~argets.. There are 12 words/sentence ~av- .</Paragraph> <Paragraph position="3"> erase) and 3sentences/message (average) (Sundheim, 1989).</Paragraph> <Paragraph position="4"> The original messages.are highly telegraphic with many im stances of sentence f/-agme,/ts;asillustrb~ted in (11.</Paragraph> <Paragraph position="5"> (t) At 1609 hostile fm-(;es \[aunched massz~/e recon effort from captured airfield against ctf 177 units transiting toward a neutrM nation* Humint sources indicated 12/3 strike acft have launched (1935z} enroutebattle force: Have positive confirmation that battle \]brce is targeted (2035z): &quot;Consid: ered hostile act mg sentence, and nP:(tating appropriate do.m~s (See. (Sefmff, 1992) for details/ Table 4 gives the s~atistics of the current system in terms of the size of the analysis lexicon/grammar and'generation lexicon/gratnmar. 7 The translation s~atistics for the training data-are shown in Table 5.</Paragraph> <Section position="1" start_page="707" end_page="708" type="sub_section"> <SectionTitle> 4.2 Evaluation -_ - ..... </SectionTitle> <Paragraph position="0"> &quot;...... ...... ..- ' . ~ .. .. &quot; ,. We have carried' ottt .two fypes of:6~si, em evaluations. The Foti.~}ag.li:.&quot;Q};i'g.in*a.'.}.:..~s~g~(.:}t't~er~ l'%,~;c0r~..P.'Opdl~}g, m OdL 7 -i..:i3'ml~..'eacgh!a\]fi-0n ,~peeificalty tes~:.g~:~mmar ~overage. and :the `1ma`::m.ess.a`ge::,~x~-i$. ~}~:~}~1.'-a`.`)~L~I`1g~s9~.~\[~,):~s~ ~Pe.~pm%~ne~...: .:. ~sbeo~ifi.:6v~.l):ia~PSd~;,}~egfS:6~rerg.lt ;sys~mh..perro&quot;i.~i\]'a}i(~e!? :' ,.:.= .-.. : cbuftgefpart -of the.origiiial., inessTgge .(1).. Nd~. tI/PSt:~ti%qmx'-.:~ :,:.&quot;:~,'~:,~,,:' >-:.. ; ;&quot;;*..<:!,~- <:-), :: :+:: :<.:, ',C/&quot;- :~,.~~: =: ~! ~::&quot; : .:&quot;&quot; .der,hned parts. *I.~, tI*e-roo(hffed rues'sage -~re .mtsmng. aft:..the'. :. 4 2&quot;1.&quot; ::E~iuti~iOh'~f ~'hi~,':~l~mn~ll~.&quot;- ',. , : .....</Paragraph> <Paragraph position="1"> o'rl~m.al-nie~shge ..&quot; '..: ' :-':'- ' ': .... .,7 ,;&quot; :~' -'..- -.'-:&quot; : ~ :, v.~>. ...... : :: ...... ' ..... %,> -:,&quot; ,..-..~&quot;:~ ..... .:,~-.' 7 . . .... .... : :-:.,,< :'. : <:7 ~i< :):. i&quot;. ;: :i-':; ,* :: : .{ :% : ..,-! :7. >.<, : -< :- :. ::-~, ..,'% <..%~-oxb~!;ag~. ::! :: .-:- .-: ':,:,::i.:: .&quot;~ .<:,, ::.&quot; &quot;; 2 ,. i ,'.: ,. :.... from _a eaptur:ed airfield agginst~ ctf 177 nHits: \[~-gi'!s'i/ihg .to2., :&quot;M:gOaH-!fl~#-:dCtabg~Se&quot;ida~: s.0t:: K?~,. g,hieh r,cd~4ig<./no, am: 1 8 ward a neutrMnat'on, ttumint sources ihdicated 12.sti'ike knowfiwords'in an in-house exl;ei\[ment. ; h{ ~hdexperhndnt; aircraft have launelied (1935z) enroutelo tPS_e battle force, we asked the; subjects 'to stud~/a tist ofldata~ set R .MUC.-II CTF 177 has positive confirmation that the: battle force is. sentendesland then create abo{ifi 10 MUC:II-likescntences o~i ,..q,o-o ( '~ &quot; &quot; ' e&quot; hbstile a t their own Subjdetswere told to create seh~ences Whict~ il- ~_'~,_ted , 20a5z, Thzs,.~s consider d a c ....</Paragraph> <Paragraph position="2"> &quot;. ..... &quot; - . . lustrate the.general .style 0f the MUC/'II s6fiterices a;nd which MUC-I\[' (tat~ a\]s0 .ha~d other ..typical feainres 'of raw ~ex~. only use the ~voccb~l!~}~y items oqen<ring in.the.example S?.n-Thei'e ~PSre *itdte a~fe~&quot;{.dst:ance~ of- e~mntex ~e~'te~ces ta~ : mnces &quot; W/~ e01teeted: la.4MUCc.lLlike . sb.ntefices iii this dxper* &quot; ....... . .'-;i:, ,< ~;~,- &quot;~ ~'a~v &quot;,:~,:a d, ala:.~.,d.n~i;~,.~v~a ~ ~,a:~2:;..., &quot;,lmen~:.-;W..{:~'h'erke~o, luatCd{tle~ys~m S pm'forraa~ce on {lmse '%i ........ , ..... ~ :::--'-',;'~, :, < :::.. .~...: ,.:': ...... &quot;7:;... :-'. :': - ....... : ::..=. ::,? s.enuen~S. ~<!,.~.~:.um, m~g~& *. ~t~es ~Orrt.Ea~.Sva,6ger:@7 ~. 5.,<::., .i ? ~'i *'. .:: ;, :% .--'>:&quot; &quot;:&quot;:<:---&quot;- ~ :,&quot;.~ ...- ':;..: v.. -:. ,<:~-,:;:-o{:tlie,K*:~eb~e<~::#e.,ti~:~e.//~g~e{.,g&t. dib:ba/~g;*.g-l'%;~ '*', ., <. :..~':2.,&quot;2q,-2<'.! .. 7&quot;:.<~::i -:.: .~i:), ~ ,-:e, ,-:--.,:-: ,i.i, =.', :/i ,--:~{t,e~:gi~eoi;~l).~i;a~l&~;d~ p ;'=-.i?: 9'.:&quot;~'e.:. * :: ,: : :',:-; ,-;. _(3) T~B- ~Ss-gmgtle~3~ed, stfit~ ~,esC/-~I-.fr~C/.~,~erg.e.ngag>id.~,...;; >5-:'..,.,; :,-,:' - '~:::,:~ .--:. ~. :_. ,&quot;:,.=..&quot;.> :.-,:.:&quot; &quot;: ::~.~. :'-. ,: .*-:'.&quot; &quot;',: :'..:.' ',~. &quot; &quot; - ' ~ * ...... ~.hll' &quot;e~fl .... ~ti ~ ' - ,, - &quot;.', ...... * - -: ~...~.,~,~..',~:~.~,~..~.~ ~.vatu~wut~ ..... , &quot;' . ........ (~stand -tin .<~, ..f~efl~ty)- - e. dtm ng %~nke.~agamst~-xxx ......................... .. :, .......... - ......... . ..... o * . mmrillb, c.amp ' &quot; ' . ,: ::&quot;-: .. -., . ~.' - :.. ~&quot;-~ - - , ' : -' ~. J >.- :-' ~'llh~ -resul~s.':6f..sy~m evM,!.g{aoft On~ .ttie ~ -tgNs dat~: {({ore * . .... -. &quot;., - &quot;-, ' ; &quot;. :. :-, ~-'.&quot;smtmg-0f 40 messagesl ii1 se~tehee's~'are ghoWn ih&quot;:31abte&quot;L.</Paragraph> <Paragraph position="3"> (4) Kirov locked on w{.th ,fire c0i, tr0l radm&quot; and:~fi, ed :t'm'pedd..: ~l{e firs:t \]~oiii{ go n6ee is-~ta-M;*d~d 'sysedt~i.;,Xrsed 34: 8%'0( I he m spetmei~on.. --. ...... : ....... .. ..... - .&quot;. test'sentene68'.wh~eh.;have.no .ur/kfiown' ~vordg,: f~his' figm:e is * :~ : -:: ', &quot; .-:&quot;::, ;.: (.:--.&quot;} -&quot; ; ,, .' . :i. ; -. &quot; : '.7; . somewhat \]owdr \],f~kn, the Corresp6nditig figure :for.data~ set '(5)/' ..... ..~ _-.. ')&quot; ;:. :< . :... :'i: .> &quot;.. :. ~The: I.,-:A* (50.6%) be.~iitise,'~*hdteSt get-is,h'ard~..Y~tm~/'daf, a i sd.t A*i deliberate-harassment' of uscgc spencer bY tiost'ite kimv* eric :: ?@he: {esi: sentences g~/~aomi~l~{eb'ne~ @lieFa~' :A.*-a~,;ka,.~,:oa da\[ngefs PSn atready:ftagite.1561i{ieal/military.balm~ee *between ' were fabriegted by' stndyi:ng 'A.sentet/ees.)Ttm Sec;nd point &quot;/:/,,;ii;~&quot;~:~,a C/..ie\]~~Di,.~a;i-~:': --. .... ,. * 7 -,..-. = 7 :.: -- 'to note-isthat.s2stem-faitm-e i~d;uein-lgr e art to the res: ...':'..: :&quot; &quot;.' =&quot; ,7, ,i&quot;. %.'- :&quot;-::: &quot; &quot;,.':-&quot;,'. &quot; &quot;;::/-&quot; ,/.-.-. '.: ~-;:&quot;7':~ .:..i. :' :enC~e:.~zurtkrioWn:w~r(ls,..ln-.:taet,~aS&quot;T~bl~..g-~i~diaates,abmlt oti~in-al an?t:tl~6'/iGdlfie~t MUU-II~'@~glaa ' Wg-discnss ,~he~de-&quot;. :eaff'tiog: tm, ndle:un.knqwn..words'.a%ttlls:ti:~e.. We.?dlscdsg.ouv 't&~s~ PSf :}ilb,,dx{)~ :l~vti~(o'{{~,i~i~t , t*~ainidg&quot; fii-',{he:'f61i0wi'ng:s~e&quot; .7: !gi~g61i~g,effo{ ts <(o.:}@k!e-fhtPS p.{10~!~'i}.i>\[ii.,ge.{~li?ngi :~,.{:..: .C/:}, .:: tion &quot; . - &quot;-~ - &quot; ' .. :-i~/:. ':: -:.- %, &quot;.:i ::,- &quot;;:< : ...... 7~- ;5.! ;.:: _v.;~:,.,<..: ...,. ,;-..<..:. -: i!.,;,!,'-..-~.:<~ 9:..~/%?,., !:' * .- . . . .- . .... . i - * &quot;4:3&quot;! Eflldency . &quot; ., : :.;,. :..:> >:'.&quot;-,: ;:.::-&quot; /;,,.</Paragraph> <Paragraph position="4"> 4 W!mini,~g ' and ~va!uation L~gbly.due toour effort toreduce-t.he ambigulty.of,ti{e/irqmt : , * :, ,-. ,_ :-.. :.......... * . - ..:.,)a~.. . <.:. :,..-. .' -:sen~e~mes,-1,l~es2,stema ru.ns etticicntky F0~: an ayernge length 4&quot; 1, ;...,Tr.ai,~ing..5-.,: .... .- . .... ~. :. :7': ' i : .~.&quot; .,:- :s~nt~ncei: co n/glning.:l-2 :v)06/s.,&quot;%g ,~dkes'_Ul~6/tt .4.~ S~6f,~Ut/, 'We fir,t p'ae'(~;i:d}{'ed<'{Im- MU c~iilddt a :c6tpus intaTth~6 {la~ .: *:.. ::.i;:. ;:: .: i ::.::', :: :)/::.:.:(::'.,,., ..:.: :! '*'. '< :.',./: ,&quot; :: : , : ..,/::..'(< ,: ;::..,.!::-.: ..:-/. : *. ~se'ts ' \]\[a gd'~ti'&&quot;-&quot;il;e ~tg~&quot; &quot;ha@% sc4 'of:-:15&quot;4 MUC t\[ like ' ; Table&quot;~t :gives the&quot;numbet&quot; Of&quot; re~-erinin{/f:v~tei&quot;g}ie~ .in .~,-: ~*:.. ~... . . e..~ * , '. - . : . = - -.' ,.. &quot; . .: * * ... .... ,.. :. ,~, ... :. . g ... ' &quot; &quot; s fro% w&-e/d~elited iff an:in h i~se ex rWnent We the anal ms ~v~mtaar The actttal number of ritt~ ~s mu h Sentence'-~ &quot; . ....... - o_ . me ..- , ..... Y -g ....... * .. .... c &quot;W~ll dNcus's' {h6ge :~e~/tgn~beg '(wh'teh, ~o~imi, ige ~(a~a~ ~t'A ~ ) in ..... g.re~,t:m;. *bde~aus~ 5FINA~ M10.ws ci4osa~ ~blli*iatiofi &quot; tlie'g'tid~dn ' * . ..... .., '-. --.-' .... &quot; .---'. &quot; * * ,,-, ' ..... ,v .v - - ,- .- ~,. ,.. l ,._ ... ~ .. .... g Seet, ing '4 A' s mrnatV o traini~ d ta ~s wen m of eorotrton elements on %he V~ ti* hm~d rode. f. roles .. S . .... 21,,,. u of: ur . a ~&quot; ': .......... * ...... &quot; &quot; .~ ,-g.- ' :&quot; ,o '. ~'.: ~ :Table&quot;3;::: g;hera: ~t~6.~{,i'iiid>gis :a(e:~h~ h~mb~r of ~entgncesi . .(: -(SefiCff,! 2{J92.~-f0r mibre':detaiis;gb6i~t': @oss-p611ina{ioh\];,/.'7:> we have trained::ehe system onMl the etairiing:datgl-The. ' : ' '~A~&quot;we~llaisci~'s~ in'{i{~{ii6:~l~g~ction:\]~.igd:ifficttl};{b iise analysis.rutes.a~e ~levelopedk/y hand, .based: on observed pat-&quot; a MUC-I\[ database to evaluaee gramnigr coverage Uecguse terns in :the dgta/ These :rules are. then converted into a net- many MUC:II sentences fail b ased.strictlv on tile fact that ~ work Structure., Pr0bability assignments on all arcs in the they contain unknown Words: i.e., w0rds which are not in the network are obtained autorffatically by parsing each train- system's lexicon* translate, it takes an average of 2.28 seconds to translate a sentence containing 16 words. For a fairly complex sentence containing 38 words, it takes about 4: seconds to translate.</Paragraph> <Paragraph position="5"> Some examples of Korean translation output are given in the som:ee ('ode is written in C. Korean translation outputs are displayed on a hangul window running on UNIX.</Paragraph> </Section> </Section> <Section position="8" start_page="708" end_page="708" type="metho"> <SectionTitle> 5 Toward Robust Translation </SectionTitle> <Paragraph position="0"> At the moment, our s~cstem is not capable of dealing with a sentence containing (i) in\]known words, cf. Section 4.2, and (~fi) unkuown constructions, of. Section 4.1. In this section we iscnss our or>going efforts to overcome these deficiencies: Integration of a part-of-speech tagger to handle unknown words/constructions, mM a word-for-word translator to cope with other system tSihn'es, of. (Frederking and Nirenburg, 1995).</Paragraph> <Section position="1" start_page="708" end_page="708" type="sub_section"> <SectionTitle> 5.1 Integration of Part-of-Speech Tagger </SectionTitle> <Paragraph position="0"> Regarding the unknown word problem, an obvious solution is to expand the lexicon. Concerning the problem involving mtknown constructions, we could easily generalize the gramram: to extend its coverage. However, both of these sohltions are t)roblematic. Ilandling the unknown word problem by increasing the size of the lexicon is not that straightforward given that most unknown words are open class items such as nouns, verbs, adjectives and adverbs. In addition, one can not generMize the grammar without side effects. Due to the highly telegraphic nature of the MUC-II data, generMizing tile grammar will increase the ambiguity of an input sentence greatly, of. (Grishman, 1989). :) tIence, we need alternative solutions to deal with unknown words and unknown constructions. Tile most desirM3le solution is to (i) leave the current grammar intact since it eifieiently parses even highly telegral)hic messages, and (ii) tackle unknown words and unknown constructions by the same mechanism.</Paragraph> <Paragraph position="1"> A potential solution to the unknown word problem is to: Do part of speeeh tagging and replace unknown words with their parts-of-speech, and bootstrap the parts-of:speech (instead of the actual words) to the analysis grammar. The unknown words would be replaced in the sentence string with their corresponding part-of-speech tag, and the semantk: grammar woukl be auginented to handle generic adjectives, nouns, verbs, etc., intermixed in the rules at appropriate positions. The idea would be to include just enough semantic information to solve the ambiguity probleln, effectively anchoring on words such as ship-name that have high semantic relevance within the dommn.</Paragraph> <Paragraph position="2"> This approach might also be effective as a backoff mechanism when the system fails to parse a sentence containing only known words. A set of semantically significant vocabulary items could be tagged as &quot;imnmtable&quot;, and all the words in tile sentence except these anchor words would be converted &quot;Recall that we resolve the ambiguity problem by constraining the grammar with semantic categories.</Paragraph> <Paragraph position="3"> to part-of-speech prior to a second attenlpt to parse. The sanle gralllnlar would be used in all cases.</Paragraph> <Paragraph position="4"> For the solution sketched al)ove, we have evMuated the</Paragraph> </Section> <Section position="2" start_page="708" end_page="708" type="sub_section"> <SectionTitle> Rule-Based Part-of-Speech Tagger (Bri\[1, 1992) on the test </SectionTitle> <Paragraph position="0"> data both before and after training on the MUC-II database.</Paragraph> <Paragraph position="1"> These results are given in Table 8. Tagging statistics q)efore training' are based on the lexicon an(t rules acquire(l fl'om tile BROWN CORPUS and the WAI,L STREET JOUR-NAI~ CORPUS. Tagging statistics 'after training' are divided into two categories, both of which are based ou the rules acquired from training on data sets A, B~ and C of the MUC-II database. The only difference between the two is that in one case (After Training I) we use a lexicon acquired fi'om the MUC-II database, and in the other case (After ~\[Y=aining II) we use a lexicon acquired ffoin a combination of the BROWN CORPUS, the WALL STREET JOURNAL COR-PUS, and the MUC-11 database. Since the tagging result is quite promising, despite the fact that |;lie training data is of modest size, we are planning to integrate the tagger into the analysis module.</Paragraph> </Section> <Section position="3" start_page="708" end_page="708" type="sub_section"> <SectionTitle> 5.2 Integration of Word-for-Word Translator </SectionTitle> <Paragraph position="0"> Even though implementii,g tim part-el:speech tagger and extending the analysis grantiltar to accept parts-of-speech as terminal strings will increase tile granmmr coverage, it; is an ahnost impossible task to write a grammar which covers all freely occurring natural language texts, let alone haw; a re.</Paragraph> <Paragraph position="1"> bust parser to (teal with this inadequacy, tdeg Despite this difficulty in designing a complete translation system, an ideal translation system onght to be able to produce translations which are useflfl under auy circuinstances. Therefore, we are integrating a word-for-word translator I ~, which provides tools to akl a human translator, as a fallback system.</Paragraph> <Paragraph position="2"> Figure 6 shows the planned robust system architectur% with the part-of-speech tagger and the word-fl)r-word translator integrated into the core understanding/generation systein. Note that the system will provide an indication or flag to the user showing whether the translation is produced by TINA/GENESIS or by the word-tbr-word fallback system.</Paragraph> </Section> </Section> class="xml-element"></Paper>