File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2162_metho.xml
Size: 15,157 bytes
Last Modified: 2025-10-06 14:07:15
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2162"> <Title>Improving SMT quality with morpho-syntactic analysis</Title> <Section position="3" start_page="0" end_page="1081" type="metho"> <SectionTitle> 2 Statistical Machine Translation </SectionTitle> <Paragraph position="0"> The goal of the translation process in statistical machine translation can l)e fornmlated as tbllows: A source language string .f~ = fl... f.! is to be translated into a target language string c\[ =- el... el. In the experiments reported in this paper, the source language is German and the target language is English. Every English string is considered as a possible translation for the intmt. If we assign a probability P'r(e\[lfi/) to each pair of strings (el, fi/), then according to Bayes' decision rule, we have to choose the English string that maximizes the I)roduct of the English language model Pr(c{) and the string translation model r'r(fff\[e{).</Paragraph> <Paragraph position="1"> Many existing systems tbr SMT (Wang and Waibel, 1997; Niefien et al., 1.(/98; Och and Weber, 1998) make use of a special way of structuring the string translation model (Brown et al., 1993): 'l?he correspondence between the words in the source and the target string is described by aligmuents that assign one target word position to each source word position. The probability of a certain English word to occur in the target string is assumed to depend basically only on the source word aligned to it. It is clear that this assumption is not always valid tbr the translation of naturM languages. It turns out that even those approaches that relax the word-by-word assumption like (Och et al., 1999) have problems with lnany phenomena typical of natural languages in general and German in partitular like * idiomatic expressions; * colnpound words that have to be translated by more than one word; * long range dependencies like prefixes of verbs placed at the end of the sentence; * ambiguous words with different meanings dependent on the context.</Paragraph> <Paragraph position="2"> Tile parameters of the statistical knowledge sources nlentioned above are trained on bi-lingual corpora. Bearing ill mind that more than 40% of the word tbrms have only been seen once in training (see q~,bles 1 and 4), it is obvious that the phenomena listed above can hardly be learned adequately from the data and that the explicit introduction of linguistic knowledge is expected to improve translation quality.</Paragraph> <Paragraph position="3"> The overall architecture of the statistical translation approach is depicted in Figure 1. hi this figure we already anticipate the t'aet that we will transtbrm the source strings in a certain manner. If necessary we can also apply the inverse of these transfbrmations on the produced output strings. Ill Section 3 we explain in detail which kinds of transtbrmations we apply.</Paragraph> </Section> <Section position="4" start_page="1081" end_page="1082" type="metho"> <SectionTitle> 3 Analysis and Transformation of </SectionTitle> <Paragraph position="0"> the Input As already pointed ouL we used the inethod of transforming the inl)ut string in our experiments. The advantage of this approach is that existing training and search procedures did not have to be adapted to new nlodels incorporating the information under consideration. On the other hand, it would be more elegant to leave the decision between different readings, tbr instance, to the overall decision process in search. Tile transtbrmation method however is nlore 3(tequate tbr the preliminary identification of those phenonmna relevant tbr improving the translation results.</Paragraph> <Section position="1" start_page="1081" end_page="1081" type="sub_section"> <SectionTitle> 3.1 Analysis </SectionTitle> <Paragraph position="0"> We used GERTWOL, a German Morphological Analyser (Haapalainen and M~@)rin, 1995) and the Constraint Grammar Parser Ibr German GERCG tbr lexical analysis and inorphological and syntactic dismnbiguation. For a description of the Constraint Grammar approach we refer the reader to (Karlsson, 1990). Some prel)rocessing was necessary to meet the input format requirements of the tools, hi the cases where the tools returned lnore thalt one reading, either simple heuristics based on domain specific pretbrence ruh;s where at)plied or a nlore general, non-mnbiguous analysis was used.</Paragraph> <Paragraph position="1"> In the following subsections we list some transtbrmations we have tested.</Paragraph> </Section> <Section position="2" start_page="1081" end_page="1082" type="sub_section"> <SectionTitle> 3.2 Separated German Verbprefixes </SectionTitle> <Paragraph position="0"> Sortie verbs in German consist of a main part and a detachable prefix which can be shifted to the end of the clause, e.g. &quot;losfahren&quot; (&quot;to leave&quot;) in the sentence &quot;Ich fahre morgen los.&quot;. We extr~cted all word forms of separable verbs fl:om th.e training corl)us. The resulting list contains entries of the tbrm prefixlmain. The entry &quot;los\[t:'ahre&quot; indicates, fi)r exalnple, that the prefix &quot;los&quot; (:an l)e detached flom the word tbrm &quot;fahre&quot;. In all clauses containing a word matching a main part and a word matching the corresponding prefix part occuring at the end of the clause, the prefix is prepended to the beginning of the main part, as in &quot;Ich losfahre morgen.&quot; a.a German Compound Words German comt)(mnd words pose special 1)roblems to the robustness of a translation method, because the word itself must be represented in the training data: the occurence of each of the coint)onents is not enough. The word &quot;I~'iichtetee&quot; tbr example can not be translated although its coml)onents &quot;Friichte&quot; and &quot;Tee&quot; appear in the training set of EUTRANS. Besides, even if the coml)ound occurs in training, tile training algorithm may not be capable of translating it properly as two words (in the nlentioned case the words &quot;fl'uit&quot; and &quot;tea&quot;) due to the word alignment assumption mentioned in Section 2. We therefore split the COml)ound words into their (:Oml)onents.</Paragraph> <Paragraph position="1"> 3,,4 Annotation with POS Tags ()he way of hell)|rig the disanfl)iguation of gillt)Jguous words is to annotate them with their t)m:l; of Sl)eech (POS) inl'()rmation. We (:hose l;he tbllowing very ti'equent short words that often (:;rased errors in translation fi)r VERBMO\]3IL: &quot;aber&quot; can 1)e adverb or (:onjun('tion.</Paragraph> <Paragraph position="2"> &quot;zu&quot; can l)e adverb, pret)osition , sepnrated verb prefix or infinitive marker.</Paragraph> <Paragraph position="3"> %ler', &quot;die&quot; and &quot;das&quot; cnn 17e definite m:ti-CIos el' \])1&quot;Ol1Ol111S.</Paragraph> <Paragraph position="4"> '.\['he difficulties due to l;hese aml)iguities m:e illustrated by the fi)lh)wing exmnt)les: The sentence &quot;Das wiird(' mir sehr gut 1)~ssen. '' is often trnnslnted 1)y &quot;Th, e would suit me very well.&quot; iltsl;e;~(l ()\[ &quot;5l'h,at would suit me very well.&quot; and &quot;Das win: zu s(:lmcll.&quot; is trnnsl;~ted by &quot;Th~Lt was to t'~lsl;.&quot; instea,(t of &quot;Theft; was too f;~st;.&quot;. We alTpended the POS l;~g in training a,mt t(;st corpus fiTr the VERBMOBII, task (see 4.\]).</Paragraph> </Section> <Section position="3" start_page="1082" end_page="1082" type="sub_section"> <SectionTitle> 3.5 Merging Phrases </SectionTitle> <Paragraph position="0"> ():c an in(lelinil;e pronoun. Like 2\] other mull;iword tThrases &quot;irg(:nd-et;wa.s&quot; is merged in order t;o form one single voca,bulary ('nl;ry.</Paragraph> </Section> <Section position="4" start_page="1082" end_page="1082" type="sub_section"> <SectionTitle> 3.6 Treatment of Unseen Words </SectionTitle> <Paragraph position="0"> l&quot;or sl;atist;i(::fl ma(:hin(; tr;mslation it is difficult 1;() handle woi'ds not seen in training. \]~br m> kllOWll i)l;O1)el; ll&llIeS~ i\[; is normally ('(TrreT't to t)bme the word un(;h~mge(t into th(; transl~fl;ion.</Paragraph> <Paragraph position="1"> We have t)(;(;n working on the l;17ea~l;nlenI; of 1111kll()Wll words of other types. As ~flr(;~dy menl;ioned in Se(:l;ion 3.3, the st)litting of eomt)ound words cml reduce |;he nmnber of unknown Cl(:rman words.</Paragraph> <Paragraph position="2"> In addition, we have examined methods of r(> pl~('ing a word \['ullform l)y ~ more ;O)stra('l; word form nnd (-heek whether this fi)rm is kn()wn and (:;~m l)e I;ranslnted. Th(' l;rmlslat, ioll of the sin> |)lifted word tbrm is generally not the precis(' trmlslai;ion of the original on(', 17ul; sometimes the intended semantics is conveyed, e.g.: &quot;kaltes&quot; is ~m adjective in the singular neuter fOl;lll &lid. c3~11 be t,l'a, nst:'ornled to the less specilic form &quot;kalt&quot; (&quot;cold&quot;).</Paragraph> <Paragraph position="3"> &quot;Jahre&quot; (&quot;years&quot;) (:~m be replaced by the singulm: form &quot;J~fln:&quot;.</Paragraph> <Paragraph position="4"> &quot;beneidest&quot; (%o envy&quot; in tirst person singular): if the infinitive tbnn &quot;beneiden&quot; is not known, it might hell).just, to remove tim leading t)artiele &quot;be&quot;.</Paragraph> </Section> </Section> <Section position="5" start_page="1082" end_page="1082" type="metho"> <SectionTitle> 4 Translation Results </SectionTitle> <Paragraph position="0"> We use the SSER (sul)jectivc sentence error rat(') (Ni('fien et al., 2000) as evaluation crit('rion: E~wh translated senten(:e is judged by ~ tmmmi exmniner according 1;(7 nn error scale ti'om 0.0 (semantically and syntaeti(:~flly col reef) to 1.0 ((:onlt)h;l;ely wrong).</Paragraph> <Section position="1" start_page="1082" end_page="1082" type="sub_section"> <SectionTitle> 4.1 Translation Results for VEm~MOmL </SectionTitle> <Paragraph position="0"> Th(, VEI{BM()BII, corpus consists of st)onttmeously spoken dialogs in t;he al)t)oint;ment sch(> (hfling domain (Wtflflster, 1993). German sent;ences ;~re l;ra.nsl;~lx;d inl;o English. The output of the st)ee('h re(:ognizer (Ibr example th(; single-best hyl)othesis ) is used as int)ut to the tr;mslation moduh',s. For resem:eh tmri/oses the original l;(;xt st)oken 1)y th(, users can t)7, t)r(;sented t() the translal;ion system t;(7 ev~flm~te the MT (:omponent set)er~ti;ely from l;hc, re(:ognizT~r.</Paragraph> <Paragraph position="1"> 'l'h('. tra.ining set (:onsist;s (Tf d5 680 s(;nl;o.n(:e pairs. Testing was carried out on ~t seper~te set of 14:7 senl;enees l;h~fl; (to not contain any mlseen words, hi Table 1 l;he ehara('teristics of the training sets are summarized for l;he original eort)ns and after l;he ai)plication of the des(:rit)ed tr~Lnsfornlat;ion.s on t;he Gerlll}~tll part of l;he col pus. \[l.'he tM)le shows that on t;his cou)us Ill(', splitting of (:Oml)OUll(ts iinl)roves l;hc l;oken-tyl)e rntio tiom 59.7 t(7 65.2, lint th(', mmfl)er of singh;tons (words s(;en only on('e in tt'nhfing) does not go down by more than 2.8%. '.l'he oth.er transfi)rm~tions (i)r(;1)ending separated verb 1)refixe,~ &quot;t)ref&quot;; mineral;ion wi|;h 1)OS t~gs &quot;i)os&quot;; merging of phrases &quot;merge&quot;) do not at\[bet these co> pus st;,l;isl;ies much.</Paragraph> <Paragraph position="2"> The translntion l)erformmme results are given in rl2~fi)le 2 tbr tra.nslat;ion of text and in 'l~fi)le</Paragraph> </Section> </Section> <Section position="6" start_page="1082" end_page="1084" type="metho"> <SectionTitle> 3 for translation of t;he single-best hyl)oth(!sis </SectionTitle> <Paragraph position="0"> given t)y a sl)eech recognizer (a('(:m:a.('y 69%).</Paragraph> <Paragraph position="1"> For t)oth cases, l;r;mslation on text ml(t on st)ee(:h int)ut , st)litting (:oml)oml(t words does not iml)rove translation quality, but it is not harmful either. The treatment of separable prefixes helps as does annotating some words with part of speech inibrmation. Merging of 1)hrases does not improve the quality much further. The best translations were adfieved with the combination of POS-annotation, phrase merging and prepending separated verb prefixes. This holds tbr t)oth translation of text and of speech input. The fact that these hard-coded transtbrmations are not only hclpflfl on text input, but also on speech input is quite encouraging. As an example makes clear this cannot be taken for granted: The test sentence &quot;Dann fahren wir dann los.&quot; is recognized as &quot;Dam1 fahren wir dann uns.&quot; and the tact that separable verbs do not occur in their separated form in the training data is mffavorable in this case. The figures show that in generM the speech recognizer output contains enough information for helpflfl preprocessing.</Paragraph> <Section position="1" start_page="1083" end_page="1084" type="sub_section"> <SectionTitle> 4.2 Translation Results for EUTRANS </SectionTitle> <Paragraph position="0"> The EUTRANS corpus consists of different types of German-English texts belonging to the tourism domain: web pages of hotels, touristic brochures and business correspondence. The string translation and language model parameters were trained on 27 028 sentence pairs. The 200 test sentences contain 150 words never seen in training.</Paragraph> <Paragraph position="1"> Table 4 summarizes the corpus statistics of the training set for the original corpus, after splitting of compound words and after additional prepending of seperated verb prefixes (&quot;split+prefixes&quot;). The splitting of compounds improves the token-type ratio flom 8.6 to 12.3 and the nmnber of words seen only once in training reduces by 8.9%.</Paragraph> <Paragraph position="2"> never seen in training reduces from 150 to 81 by compound splitting and can further be reduced to 69 by replacing the unknown word forms by more general forms. 80 unknown words are encountered when verb prefixes are treated in addition to compound splitting.</Paragraph> <Paragraph position="3"> Experiments for POS-annotation have not been pertbrmed on this corpus because no small set of ambiguous words causing many of the translation errors on this |;ask can be identified: Comt)ared to |;it(', VERBMOBIL task, this tort)us is less homogeneous. Merging of 1)hrases did not help much on VEI/,BMOBIL and is theretbre not tested here.</Paragraph> <Paragraph position="4"> Tal)le 5 shows that the splitting of comt)ound words yields an improvement in the subjective sentence error rate of 4.5% and the treatment of unknown words (&quot;unk&quot;) improves the translation quality by an additional 1%. Treating SOl)arable verb 1)refixes in addition to splitting compounds gives the be, st result so far with an improvement of 7.1% absolute COml)ared to the l)aseline.</Paragraph> </Section> </Section> class="xml-element"></Paper>