File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1407_metho.xml
Size: 19,333 bytes
Last Modified: 2025-10-06 14:07:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1407"> <Title>Toward hierarchical models for statistical machine translation of inflected languages</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Statistical Machine Translation </SectionTitle> <Paragraph position="0"> The goal of the translation process in statistical machine translation can be formulated as follows: A source language string a2a4a3a5a7a6 a2 a5a4a8a9a8a9a8 a2 a3is to be translated into a target language string</Paragraph> <Paragraph position="2"> paper, the source language is German and the target language is English. Every English string is considered as a possible translation for the input.</Paragraph> <Paragraph position="3"> If we assign a probability a13a15a14a17a16 a10a12a11a5a19a18a2 a3a5a21a20 to each pair of strings a16 a10 a11a5a12a22 a2a4a3a5 a20 , then according to Bayes' decision rule, we have to choose the English string that maximizes the product of the English language model a13a23a14a24a16 a10 a11a5 a20 and the string translation model a13a15a14a17a16a25a2 a3a5a26a18a10a27a11a5a28a20 .</Paragraph> <Paragraph position="4"> Many existing systems for statistical machine translation (Wang and Waibel, 1997; Niessen et al., 1998; Och and Weber, 1998) make use of a special way of structuring the string translation model like proposed by (Brown et al., 1993): The correspondence between the words in the source and the target string is described by alignments which assign one target word position to each source word position. The lexicon probability to depend basically only on the source word a2 aligned to it.</Paragraph> <Paragraph position="5"> The overall architecture of the statistical translation approach is depicted in Figure 1. In this figure we already anticipate the fact that we can transform the source strings in a certain manner.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Basic Considerations </SectionTitle> <Paragraph position="0"> The parameters of the statistical knowledge sources mentioned above are trained on bilingual based on Bayes' decision rule.</Paragraph> <Paragraph position="1"> corpora. In general, the resulting probabilistic lexica contain all word forms occurring in this training corpora as separate entries, not taking into account whether or not they are derivatives of the same lemma. Bearing in mind that 40% of the word forms have only been seen once in training (see Table 2), it is obvious that learning the correct translations is difficult for many words. Besides, new input sentences are expected to contain unknown word forms, for which no translation can be retrieved from the lexica. As Table 2 shows, this problem is especially relevant for highly inflected languages like German: Texts in German contain many more different word forms than their English translations. The table also reveals that these words are often derived from a much smaller set of base forms (&quot;lemmata&quot;), and when we look at the number of different lemmata and the respective number of lemmata, for which there is only one occurrence in the training data, German and English texts are more resembling.</Paragraph> <Paragraph position="2"> Another aspect is the fact that conventional dictionaries are often available in an electronic form for the considered language pair. Their usability for statistical machine translation is restricted because they are substantially different from full bilingual parallel corpora inasmuch the entries are often pairs of base forms that are translations of each other, whereas the corpora contain full sentences with inflected forms. To make the information taken from external dictionaries more useful for the translation of inflected language is an interesting objective.</Paragraph> <Paragraph position="3"> As a consequence of these considerations, we aim at taking into account the interdependencies between the different derivatives of the same base form.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Output Representation after </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Morpho-syntactic Analysis </SectionTitle> <Paragraph position="0"> We use GERCG, a constraint grammar parser for German for lexical analysis and morphological and syntactic disambiguation. For a description of the Constraint Grammar approach we refer the reader to (Karlsson, 1990). Figure 2 gives an example of the information provided by this tool.</Paragraph> <Paragraph position="1"> Input: Wir wollen nach dem Essen nach Essen aufbrechen &quot;<*wir>&quot;&quot;wir&quot; * PRON PERS PL1 NOM &quot;<wollen>&quot;&quot;wollen&quot; V IND PR&quot;AS PL1 &quot;<nach>&quot;&quot;nach&quot; pre PR&quot;AP Dat &quot;<dem>&quot;&quot;das&quot; ART DEF SG DAT NEUTR &quot;<*essen>&quot;&quot;*essen&quot; S NEUTR SG DAT &quot;<nach>&quot;&quot;nach&quot; pre PR&quot;AP Dat &quot;<*essen>&quot;&quot;*essen&quot; S EIGEN NEUTR SG DAT &quot;*esse&quot; S FEM PL DAT&quot;*essen&quot; S NEUTR PL DAT &quot;*essen&quot; S NEUTR SG DAT &quot;<aufbrechen>&quot;&quot;aufbrechen&quot; V INF A full word form is represented by the information provided by the morpho-syntactic analysis: From the interpretation &quot;gehen-V-IND-PR&quot;AS-SG1&quot;, i.e. the lemma plus part of speech plus the other tags the word form &quot;gehe&quot; can be restored. From Figure 2 we see that the tool can quite reliably disambiguate between different readings: It infers for instance that the word &quot;wollen&quot; is a verb in the indicative present first person plural form. Without any context taken into account, &quot;wollen&quot; has other readings. It can even be interpreted as derived not from a verb, but from an adjective with the meaning &quot;made of wool&quot;. In this sense, the information inherent to the original word forms is augmented by the disambiguating analyzer. This can be useful for deriving the correct translation of ambiguous words.</Paragraph> <Paragraph position="2"> In the rare cases where the tools returned more than one reading, it is often possible to apply simple heuristics based on domain specific preference rules or to use a more general, non-ambiguous analysis.</Paragraph> <Paragraph position="3"> The new representation of the corpus where full word forms are replaced by lemma plus morphological and syntactic tags makes it possible to gradually reduce the information: For example we can consider certain instances of words as equivalent. We have used this fact to better exploit the bilingual training data along two directions: Omitting unimportant information and using hierarchical translation models.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Equivalence classes of words with </SectionTitle> <Paragraph position="0"> similar Translations Inflected forms of words in the input language contain information that is not relevant for translation. This is especially true for the task of translating from a highly inflected language like German into English for instance: In bilingual German-English corpora, the German part contains many more different word forms than the English part (see Table 2). It is useful for the process of statistical machine translation to define equivalence classes of word forms which tend to be translated by the same target language word, because then, the resulting statistical translation lexica become smoother and the coverage is significantly improved. We construct these equivalence classes by omitting those informations from morpho-syntactic analysis, which are not relevant for the translation task.</Paragraph> <Paragraph position="1"> The representation of the corpus like it is provided by the analyzing tools helps to identify and access - the unimportant information. The definition of relevant and unimportant information, respectively, depends on many factors like the involved languages, the translation direction and the choice of the models.</Paragraph> <Paragraph position="2"> Linguistic knowledge can provide information about which characteristics of an input sentence are crucial to the translation task and which can be ignored, but it is desirable to find a method for automating this decision process. We found that the impact on the end result due to different choices of features to be ignored was not large enough to serve as reliable criterion. Instead, we could think of defining a likelihood criterion on a held-out corpus for this purpose. Another possibility is to assess the impact on the alignment quality after training, which can be evaluated automatically (Langlais et al., 1998; Och and Ney, 2000), but as we found that the alignment quality on the Verbmobil data is consistently very high, and extremely robust against manipulation of the training data, we abandoned this approach.</Paragraph> <Paragraph position="3"> We resorted to detecting candidates from the probabilistic lexica trained for translation from German to English. For this, we focussed on those derivatives of the same base form, which resulted in the same translation. For each set of tags, we counted how often an additional tag could be replaced by a certain other tag without effect on the translation. Table 1 gives some of the most frequently identified candidates to be ignored while translating: The gender of nouns is irrelevant for their translation (which is straightforward, because the gender is unambiguous for a certain noun) and the case, i.e. nominative, dative, accusative. For the genitive forms, the translation in English differs. For verbs we found the candidates number and person. That is, the translation of the first person singular form of a verb is often the same as the translation of the third person plural form, for example.</Paragraph> <Paragraph position="4"> adjective gender, case and number number case As a consequence, we dropped those tags, which were most often identified as irrelevant for translation from German to English.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Hierarchical Models </SectionTitle> <Paragraph position="0"> One way of taking into account the interdependencies of different derivatives of the same base form is to introduce equivalence classes a30a32a31 at various levels of abstraction starting with the inflected form and ending with the lemma.</Paragraph> <Paragraph position="1"> Consider, for example, the German verb form a2 a6 &quot;ankomme&quot;, which is derived from the lemma &quot;ankommen&quot; and which can be translated into English by a10 a6 &quot;arrive&quot;. The hierarchy of equivalence classes is as follows:</Paragraph> <Paragraph position="3"> a39 is the maximal number of morpho-syntactic tags. a30 a33a35a34 a5 contains the forms &quot;ankomme&quot;, &quot;ankommst&quot; and &quot;ankommt&quot;; in a30a32a33a40a34a37a36 the number (SG or PL) is ignored and so on. The largest equivalence class contains all derivatives of the infinitive &quot;ankommen&quot;.</Paragraph> <Paragraph position="4"> We can now define the lexicon probability of a word a2 to be translated by a10 with respect to the</Paragraph> <Paragraph position="6"> word where the lemma a52a51a38 and a41 additional tags are taken into account. For the example above,</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Translation Experiments </SectionTitle> <Paragraph position="0"> Experiments were carried out on Verbmobil data, which consists of spontaneously spoken dialogs in the appointment scheduling domain (Wahlster, 1993). German source sentences are translated into English.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.1 Treatment of Ambiguity </SectionTitle> <Paragraph position="0"> Common bilingual corpora normally contain full sentences which provide enough context information for ruling out all but one reading for an inflected word form. To reduce the remaining uncertainty, we have implemented preference rules.</Paragraph> <Paragraph position="1"> For instance, we assume that the corpus is correctly true-case-converted beforehand and as a consequence, we drop non-noun interpretations of uppercase words. Besides, we prefer indicative verb readings instead of subjunctive or imperative. For the remaining ambiguities, we resort to the unambiguous parts of the readings, i.e. we drop all tags causing mixed interpretations.</Paragraph> <Paragraph position="2"> There are some special problems with the analysis of external lexica, which do not provide enough context to enable efficient disambiguation. We are currently implementing methods for handling this special situation.</Paragraph> <Paragraph position="3"> It can be argued that it would be more elegant to leave the decision between different readings, for instance, to the overall decision process in search.</Paragraph> <Paragraph position="4"> We plan this integration for the future.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.2 Performance Measures </SectionTitle> <Paragraph position="0"> We use the following evaluation criteria (Niessen et al., 2000): a64 SSER (subjective sentence error rate): Each translated sentence is judged by a human examiner according to an error scale from 0.0 (semantically and syntactically correct) to 1.0 (completely wrong).</Paragraph> <Paragraph position="1"> 1The probability functions are defined to return zero for impossible interpretations of a65 .</Paragraph> <Paragraph position="2"> a64 ISER (information item semantic error rate): The test sentences are segmented into information items; for each of them, the translation candidates are assigned either &quot;ok&quot; or an error class. If the intended information is conveyed, the error count is not increased, even if there are slight syntactical errors, which do not seriously deteriorate the intelligibility.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.3 Translation Results </SectionTitle> <Paragraph position="0"> The training set consists of 58 322 sentence pairs.</Paragraph> <Paragraph position="1"> Table 2 summarizes the characteristics of the training corpus used for training the parameters of Model 4 proposed in (Brown et al., 1993). Testing Singletons are types occurring only once in training. null English German no. of running words 550 213 519 790 no. of word forms 4 670 7 940 no. of singletons 1 696 3 452 singletons [%] 36 43 no. of lemmata 3 875 3 476 no. of singletons 1 322 1 457 was carried out on 200 sentences not contained in the training data. For a detailed statistics see Table 3.</Paragraph> <Paragraph position="2"> for German-to-English translation. Unknowns are word forms not contained in the training corpus. no. of sentences 200 no. of running words 2 055 no. of word forms 385 no. of unknown word forms 25 We used a translation system called &quot;singleword based approach&quot; described in (Tillmann and Ney, 2000) and compared to other approaches in (Ney et al., 2000).</Paragraph> <Paragraph position="3"> So far we have performed experiments with hierarchical lexica, where two levels are combined, i.e. a39 in Equation (2) is set to 1. a60 a38 and a60 a5 are set to a66 a8a68a67 and a29 a16a25a2 a18a52a51a38 a20 is modeled as a uniform distribution over all derivations of the lemma a52a42a38 occurring in the training data plus the base form itself, in case it is not contained. The process of lemmatization is unique in the majority of cases, and as a consequence, the sum in Equation (1) is not needed for a two-level lexicon combination of full word forms and lemmata.</Paragraph> <Paragraph position="4"> As the results summarized in Table 4 show, the combined lexicon outperforms the conventional one-level lexicon. As expected, the quality gain achieved by smoothing the lexicon is larger if the training procedure can take advantage of an additional conventional dictionary to learn translation pairs, because these dictionaries typically only contain base forms of words, whereas translations of fully inflected forms are needed in the test situation.</Paragraph> <Paragraph position="5"> Examples taken from the test set are given in Figure 3. Smoothing the lexicon entries over the derivatives of the same lemma enables the translation of &quot;sind&quot; by &quot;would&quot; instead of &quot;are&quot;. The smoothed lexicon contains the translation &quot;convenient&quot; for any derivative of &quot;bequem&quot;. The comparative &quot;more convenient&quot; would be the completely correct translation.</Paragraph> <Paragraph position="6"> As already mentioned, we resorted to choosing one single reading for each word by applying some heuristics (see Section 7.1). For the normal training corpora, unlike additional external dictionaries, this is not critical because they contain predominantly full sentences which provide enough context for an efficient disambiguation. Currently, we are working on the problem of analyzing the entries in conventional dictionaries, but for the time being, experiments for equivalence classes have been carried out using only bilingual corpora for estimating the model parameters.</Paragraph> <Paragraph position="7"> Table 5 shows the effect of the introduction of equivalence classes. The information from the morpho-syntactic analyzer (stems plus tags like described in Section 4) is reduced by dropping unimportant information like described in Section 5. Both error metrics could be decreased in comparison to the usage of the original corpus with inflected word forms. A reduction of 3.3% of the information item semantic error rate shows that more of the intended meaning could be found in the produced translations.</Paragraph> <Paragraph position="8"> classes. For the baseline we used the original inflected word forms.</Paragraph> <Paragraph position="9"> SSER [%] ISER [%] inflected words 37.4 26.8 equivalence classes 35.9 23.5 The first two examples in Figure 4 demonstrate the effect of the disambiguating analyzer which identifies &quot;Hotelzimmer&quot; as singular on the basis of the context (the word itself can represent the plural form as well), and &quot;das&quot; as article in contrast to a pronoun. The third example shows the advantage of grouping words in equivalence classes: The training data does not contain the word &quot;billigeres&quot;, but when generalizing over the gender and case information, a correct translation can be produced.</Paragraph> </Section> </Section> class="xml-element"></Paper>