File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/96/c96-1030_concl.xml

Size: 6,235 bytes

Last Modified: 2025-10-06 13:57:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1030">
  <Title>Example-Based Machine Translation in the Pangloss System</Title>
  <Section position="12" start_page="172" end_page="172" type="concl">
    <SectionTitle>
8 Strengths and Weaknesses
</SectionTitle>
    <Paragraph position="0"> As {:urrenl, ly i~q}h.uenl,e(I, I)m~EI~MT has I,()l,h si, rengl,hs ~tnd wea, knesses, ll, s s{;renglhs are l, ha, l, l, he nfininmt knowledge req.ired all(}ws {luM( re-I,argcl, ing and flint, it,s {lesip;n l}rovi{les I'{}r gra.ee \['u\] degra{lal, i{}n. Its we~knesses are thai; ii, is umd}le t,o conq}let,ely e{}ver inlsul, s , Lh;t{, it, {l{}es not per\['ornl well when the correspondences I)etween som'ce-ltutgu~ge mM l,~rgct-l~mguage words ~re not one-to-.one, ~nd that~ (like statistically-based tr~mshrl, ion sysl,ems) i|, is sensitive to dif-I'ereltce,q 1)eLweel/ Lhe example corpllS 811(I I;he Sell1,ences l,o be I;ranslat,ed, The astul, e rca.(ler will have noticed that there have been virl, ually no ment, ions of l, he source or t;arget, langmtges iP, this paper they ~r{~ not relcva.{; 1,15 discussions of the design ~md operal, i{m of l, he engine, since t;he only language-{telsen{len{, kn(,wledge consists of l, he e(luivMence {:lasses and the lists of insert;able ;m{l cli&amp;d)le wor{ts, which are ln'ovided via the {:Ollligllra,t, iOll lile. This l;mguage-indel}endent asl}e{:l; of EBMT mM&lt;es I}~mEI~MT r~Lpidly retargetM)le {;o other l~mguage l}Mrs, and in f~t{:l; thcre are ;dready versions {}f I};mF, I~MT provi{ling Serhocro;Ll,ia.n-i:o-English and El\]glish-to-Serhocroati~m trmlslal;i(ms (m) exl)erimenl, M (t/-tl, a, i8 ,~ts yel, ~twuilM,le for SeP ho{:roa, Lia, ll I)e{::4118e t.he {:olnl)\]eLe {li(:ti{}n;~ry an(I {'{srpus are sl, itt heiug acquire(l). (~iven thc 1,hree re{luire(I knowledge S(}llP{'es O\[&amp;quot; e{)rpllS, {li{:l, ionary, and word-root, list,, PanEI{MT can begin pro{h.&gt; iug tr~mslat, ions for a new langtmgc pair in only a, few h{}urs. I,'iue tuniug will require one {;o two weeks 12o del;erlnine reasOllS,\])Ie word (:\[a, sses \['()r i;okenizal;ion (along; with the required rc-indexing of the {:orlms) a.nd t{) adjusl; the scoring fllll(:l, iOll weighl,s.</Paragraph>
    <Paragraph position="1"> Nun~l}er ~m(l qualil,y of I;rmlsl;tl, ions {legra(les gradually as the size and (lualil# {}\[ the I)ilingual diction&amp;try aim synonym list (leerease. An in{:{mtl}tel,e (licl;i{mary or rool,/synonym list m{wely causes Pan EB M'\[' l;o miss son.2 potenl, ial tr;mslat,i{}ns. Similarly, a smMM' {:orpus t}r{}duces fewer l}otential m~d, ches, I}ut there i~ no t}oinl, 12}r ~my (51&amp;quot; l, he l, hree lC/nowle{Ip#2 SOllrces ~tl, which the etlg~ilte su{hlenly {:eases 1;o \['tlllCLiOll. ()lie can I M(c advantage of this gradual beh~wior l}y tmihting {he knowledgc sources incrcmenl;Mly and using I!;I~MT fOP l, ra, llSlaJ, iOllS evell I)el'ore the kn{}wledge sources trove I)ecn eomplc(,e{I. In I)ar{,icul~tr, 1}y a(htinp; l},}sl,-edil,ed oul, l}lll, Of the MT sysl,elll I}ack into 1,11{': {:Ol'l)llS } l;\[Ie sysL{':lll c;I, ll I}{: 1}{}o{;s{,ra, i}l}e{I I'r{}nl a rela.tively mo{lesl, inil, ial coPi)ll8 (precisely the i{tea, l}ehin{l ~ l;r;-msla,(;i{)n nlenlory).</Paragraph>
    <Paragraph position="2"> I)uring l)repa.r~l;ion of this l)a.1}er, severM exl,l'.~tlleOIlS lines were discovered in the eorlms files, w}lich ('all:-;(:(I lll(}l'C, l;ha, ll 2!)/1110 8ell|,{:ll(:e p;Lil'S (over 4% o\[&amp;quot; Lhc eorl}lls ) l,o t}{~ corrul)l, ed. I)11(2 1,{) t, he exl;r~l lines, the corrut)l,ed pairs consisl, ed of the English target senl;enee t'ronl one pair and l, he Spanish sotn'ec senl, en{',e tY=(}m the following I)air.</Paragraph>
    <Paragraph position="3"> 'l'his error had n(}t I}een diseovere{I earlier 1}e{:ause il, had n{} ol)vious effect ou I}anEl3MT's perforlnmt(:e ~t clear exa.ml}le of the sysl;enl}s graceful { h~gra.{Ial, i(sn i}r{q}erl:y.</Paragraph>
    <Paragraph position="4"> I,ack {)f (:~)mt)lel,e in\[)/ll, {:{}w, rage is a severe {)t}s{,;i.cle l,O IlSill~ I'anl,',l~IMT as a sl,and-ahme I, rans 17 3 lation system. The engine can not generate a chunk for a word unless it both co-occurs with either the preceding or following word somewhere in the corpus, and at least one occurrence can be successfiflly aligned. Additionally, candidate chunks are omitted if the alignment was successfifl but the scoring function indicates a poor match. Unless all of these conditions are met, a gap in output occurs for the particular input word. In the context of the Pangloss system, such gaps are not a problem, since one of the other engines can usually supply a translation covering each gap.</Paragraph>
    <Paragraph position="5"> As currently implemented, the EBMT engine is unable to properly deal with translations that do not involve one-for-one correspondences between source and target words (e.g. Spanish &amp;quot;rail milliones&amp;quot; corresponding to English &amp;quot;billions&amp;quot;). Lack of a one-to-one correspondence between source-language and target:language expressions can often cause the alignment to be incorrect or fail altogether under the current alignment algorithm.</Paragraph>
    <Paragraph position="6"> Since the corpus used in the experiments described here was based almost entirely on the UN proceedings rather than newswire text, PanEBMT did not find many long chunks during the evaluation. In fact, the average chunk was just over three words in length, and less than three percent of the chunks were more than six words long.</Paragraph>
    <Paragraph position="7"> This quite naturally affects the quality of the final translation, since many short pieces must be assembled into a translation rather than one or two long segments.</Paragraph>
    <Paragraph position="8"> Despite all these difficulties, PanEBMT was able to cover 70.2% of the input it was presented with good chunks, and generate some translation for more than 84ordinarily not outpnt at all). Integrating the hand-crafted glossaries from Pangloss into the corpus, thus adding 148,600 effectively pre-aligned phrases to the corpus, improved the matches against the corpus from 90.4% to 90.9% of the input, and the coverage with good chunks to 73.3%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML