File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/j04-2003_metho.xml

Size: 44,002 bytes

Last Modified: 2025-10-06 14:08:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="J04-2003">
  <Title>c(c) 2004 Association for Computational Linguistics Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information</Title>
  <Section position="4" start_page="183" end_page="186" type="metho">
    <SectionTitle>
2. Morpho-syntactic Information
</SectionTitle>
    <Paragraph position="0"> A prerequisite for the methods for improving the quality of statistical machine translation described in this article is the availability of various kinds of morphological and syntactic information. This section describes the output resulting from morpho-syntactic analysis and explains which parts of the analysis are used and how the output is represented for further processing.</Paragraph>
    <Section position="1" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
2.1 Description of the Analysis Results
</SectionTitle>
      <Paragraph position="0"> For obtaining the required morpho-syntactic information, the following analyzers for German and English were applied: gertwol and engtwol for lexical analysis and gercg and engcg for morphological and syntactic disambiguation. For a description of the underlying approach, the reader is referred to Karlsson (1990). Tables 1 and 2 give examples of the information provided by these tools.</Paragraph>
    </Section>
    <Section position="2" start_page="183" end_page="185" type="sub_section">
      <SectionTitle>
2.2 Treatment of Ambiguity
</SectionTitle>
      <Paragraph position="0"> The examples in Tables 1 and 2 demonstrate the capability of the tools to disambiguate among different readings: For instance, they infer that the word wollen is a verb in the indicative present first-person plural form. Without any context taken into account,  Computational Linguistics Volume 30, Number 2 Table 1 Sample analysis of a German sentence. Input: Wir wollen nach dem Abendessen nach Essen aufbrechen. (In English: We want to start for Essen after dinner.) Original Base form Tags Wir wir personal-pronoun plural first nominative wollen wollen verb indicative present plural first nach nach preposition dative dem das definite-article singular dative neuter Abendessen Abend#essen noun neuter singular dative nach nach preposition dative Essen Essen noun name neuter singular dative Esse noun feminine plural dative Essen noun neuter plural dative Essen noun neuter singular dative aufbrechen auf|brechen verb separable infinitive Table 2 Sample analysis of an English sentence. Input: Do we have to reserve rooms?.</Paragraph>
      <Paragraph position="1"> Original Base form Tags Do do verb present not-singular-third finite auxiliary we we personal-pronoun nominative plural first subject have have verb infinitive not-finite main to to infinitive-marker reserve reserve verb infinitive not-finite main rooms room noun nominative plural object wollen has other readings. It can even be interpreted as derived from an adjective with the meaning &amp;quot;made of wool.&amp;quot; The inflected word forms on the German part of the Verbmobil (cf. Section 7.1.1) corpus have on average 2.85 readings (1.86 for the English corpus), 58% of which can be eliminated by the syntactic analyzers on the basis of sentence context.</Paragraph>
      <Paragraph position="2"> Common bilingual corpora normally contain full sentences, which provide enough context information for ruling out all but one reading for an inflected word form. To reduce the remaining uncertainty, preference rules have been implemented. For instance, it is assumed that the corpus is correctly true-case-converted beforehand, and as a consequence, non-noun readings of uppercase words are dropped. Furthermore, indicative verb readings are preferred to subjunctive or imperative. In addition, some simple domain-specific heuristics are applied. The reading &amp;quot;plural of Esse&amp;quot; for the German word form Essen, for instance, is much less likely in the domain of appointment scheduling and travel arrangements than the readings &amp;quot;proper name of the town Essen&amp;quot; or the German equivalent of the English word meal. As can be seen in Table 3, the reduction in the number of readings resulting from these preference rules is fairly small in the case of the Verbmobil corpus.</Paragraph>
      <Paragraph position="3"> The remaining ambiguity often lies in those parts of the information which are not used or which are not relevant to the translation task. For example, the analyzers cannot tell accusative from dative case in German, but the case information is not essential for the translation task (see also Table 4). Section 2.4 describes a method  By resorting to unambiguous part 1.00 1.00 for selecting morpho-syntactic tags considered relevant for the translation task, which results in a further reduction in the number of readings per word form to 1.06 for German and 1.01 for English. In these rare cases of ambiguity it is admissible to resort to the unambiguous parts of the readings, that is, to drop all tags causing mixed interpretations. Table 3 summarizes the gradual resolution of ambiguity. The analysis of conventional dictionaries poses some special problems, because they do not provide enough context to enable effective disambiguation. For handling this special situation, dedicated methods have been implemented; these are presented in Section 5.1.</Paragraph>
    </Section>
    <Section position="3" start_page="185" end_page="185" type="sub_section">
      <SectionTitle>
2.3 The Lemma-Tag Representation
</SectionTitle>
      <Paragraph position="0"> A full word form is represented by the information provided by the morpho-syntactic analysis: from the interpretation gehenverbindicativepresentfirstsingular, that is, the base form plus part of speech plus the other tags, the word form gehe can be restored. It has already been mentioned that the analyzers can disambiguate among different readings on the basis of context information. In this sense, the information inherent in the original word forms is augmented by the disambiguating analyzer.</Paragraph>
      <Paragraph position="1"> This can be useful for choosing the correct translation of ambiguous words. Of course, these disambiguation clues result in an enlarged vocabulary. The vocabulary of the new representation of the German part of the Verbmobil corpus, for example, in which full word forms are replaced by base form plus morphological and syntactic tags (lemmatag representation), is one and a half times as large as the vocabulary of the original corpus. On the other hand, the information in the lemma-tag representation can be accessed gradually and ultimately reduced: For example, certain instances of words can be considered equivalent. This fact is used to better exploit the bilingual training data along two directions: detecting and omitting unimportant information (see Section 2.4) and constructing hierarchical translation models (see Section 4). To summarize, the lemma-tag representation of a corpus has the following main advantages: It makes context information locally available, and it allows information to be explicitly accessed at different levels of abstraction.</Paragraph>
    </Section>
    <Section position="4" start_page="185" end_page="186" type="sub_section">
      <SectionTitle>
2.4 Equivalence Classes of Words with Similar Translation
</SectionTitle>
      <Paragraph position="0"> Inflected word forms in the input language often contain information that is not relevant for translation. This is especially true for the task of translating from a more inflecting language like German into English, for instance: In parallel German/English corpora, the German part contains many more distinct word forms than the English part (see, for example, Table 5). It is useful for the process of statistical machine translation to define equivalence classes of word forms which tend to be translated by the same target language word: The resulting statistical translation lexicon becomes  and case (nominative, dative, accusative) Verb Number (singular, plural) and person (first, second, third) Adjective Gender, case, and number Number Case smoother, and the coverage is considerably improved. Such equivalence classes are constructed by omitting those items of information from morpho-syntactic analysis which are not relevant for translation.</Paragraph>
      <Paragraph position="1"> The lemma-tag representation of the corpus helps to identify the unimportant information. The definition of relevant and unimportant information, respectively, depends on many factors like the languages involved, the translation direction, and the choice of the models. We detect candidates for equivalence classes of words automatically from the probabilistic lexicon trained for translation from German to English. For this purpose, those inflected forms of the same base form which result in the same translation are inspected. For each set of tags T, the algorithm counts how often an additional tag t  can be replaced with a certain other tag t  without effect on the translation. As an example, let T = 'blau-adjective', t  ='feminine'.</Paragraph>
      <Paragraph position="2"> The two entries ('blau-adjective-masculine'|'blue') and ('blau-adjective-feminine'|'blue') are hints for detecting gender as nonrelevant when translating adjectives into English. Table 4 lists some of the most frequently identified candidates to be ignored while translating: The gender of nouns is irrelevant for their translation (which is straightforward, as the gender of a noun is unambiguous), as are the cases nominative, dative, accusative. (For the genitive forms, the translation in English differs.) For verbs the candidates number and person were found: The translation of the first-person singular form of a verb, for example, is often the same as the translation of the third-person plural form. Ignoring (dropping) those tags most often identified as irrelevant for translation results in the building of equivalence classes of words. Doing so results in a smaller vocabulary, one about 65.5% the size of the vocabulary of the full lemma-tag representation of the Verbmobil corpus, for example--it is even smaller than the vocabulary of the original full-form corpus.</Paragraph>
      <Paragraph position="3"> The information described in this section is used to improve the quality of statistical machine translation and to better exploit the available bilingual resources.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="186" end_page="187" type="metho">
    <SectionTitle>
3. Treatment of Structural Differences
</SectionTitle>
    <Paragraph position="0"> Difference in sentence structure is one of the main sources of errors in machine translation. It is thus promising to &amp;quot;harmonize&amp;quot; the word order in corresponding sentences.</Paragraph>
    <Paragraph position="1"> The presentation in this section focuses on the following aspects: question inversion and separated verb prefixes. For a more detailed discussion of restructuring for statistical machine translation the reader is referred to Niessen and Ney (2000, 2001).</Paragraph>
    <Section position="1" start_page="186" end_page="187" type="sub_section">
      <SectionTitle>
3.1 Question Inversion
</SectionTitle>
      <Paragraph position="0"> In many languages, the sentence structure of questions differs from the structure in declarative sentences in that the order of the subject and the corresponding finite verb is inverted. From the perspective of statistical translation, this behavior has some dis- null Niessen and Ney SMT with Scarce Resources advantages: The algorithm for training the parameters of the target language model</Paragraph>
      <Paragraph position="2"> ), which is typically a standard n-gram model, cannot deduce the probability of a word sequence in an interrogative sentence from the corresponding declarative form. The same reasoning is valid for the lexical translation probabilities of multiwordphrase pairs. To harmonize the word order of questions with the word order in declarative sentences, the order of the subject (including the appendant articles, adjectives etc.) and the corresponding finite verb is inverted. In English questions supporting dos are removed. The application of the described preprocessing step in the bilingual training corpus implies the necessity of restoring the correct forms of the translations produced by the machine translation algorithm. This procedure was suggested by Brown et al. (1992) for the language pair English and French, but they did not report on experimental results revealing the effect of the restructuring on the translation quality.</Paragraph>
    </Section>
    <Section position="2" start_page="187" end_page="187" type="sub_section">
      <SectionTitle>
3.2 Separated Verb Prefixes
</SectionTitle>
      <Paragraph position="0"> German prefix verbs consist of a main part and a detachable prefix, which can be shifted to the end of the clause. For the automatic alignment process, it is often difficult to associate one English word with more than one word in the corresponding German sentence, namely, the main part of the verb and the separated prefix. To solve the problem of separated prefixes, all separable word forms of verbs are extracted from the training corpus. The resulting list contains entries of the form prefix|main.</Paragraph>
      <Paragraph position="1"> In all clauses containing a word matching a main part and a word matching the corresponding prefix part occurring at the end of the clause, the prefix is prepended to the beginning of the main part.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="187" end_page="192" type="metho">
    <SectionTitle>
4. Hierarchical Lexicon Models
</SectionTitle>
    <Paragraph position="0"> In general, the probabilistic lexicon resulting from training the translation model contains all word forms occurring in the training corpus as separate entries, not taking into account whether or not they are inflected forms of the same lemma. Bearing in mind that typically more than 40% of the word forms are seen only once in training (see, for example, Table 5), it is obvious that for many words, learning the correct translations is difficult. Furthermore, new input sentences are expected to contain unknown word forms, for which no translation can be retrieved from the lexicon. This problem is especially relevant for more-inflecting languages like German: Texts in German contain many more distinct word forms than their English translations. Table 5 also reveals that these words are often generated via inflection from a smaller set of base forms.</Paragraph>
    <Section position="1" start_page="187" end_page="188" type="sub_section">
      <SectionTitle>
4.1 A Hierarchy of Equivalence Classes of Inflected Word Forms
</SectionTitle>
      <Paragraph position="0"> As mentioned in Section 2.3, the lemma-tag representation of the information from morpho-syntactic analysis makes it possible to gradually access information with different grades of abstraction. Consider, for example, the German verb form ankomme, which is the indicative present first-person singular form of the lemma ankommen and can be translated into English by arrive. The lemma-tag representation provides an  Computational Linguistics Volume 30, Number 2 * the base form (e.g., ankommen).</Paragraph>
      <Paragraph position="1"> In the following, t</Paragraph>
      <Paragraph position="3"> denotes the representation of a word where the base form t  and i additional tags are taken into account. For the example above, t  where n is the maximum number of morpho-syntactic tags. The mapping from the full lemma-tag representation back to inflected word forms is generally unambiguous; thus F n contains only one element, namely, ankomme. F n[?]1 contains the forms ankomme, ankommst, and ankommt;inF n[?]2 the number (singular or plural)isignored, and so on. The largest equivalence class contains all inflected forms of the base form ankommen.</Paragraph>
      <Paragraph position="4">  Section 4.2 introduces the concept of combining information at different levels of abstraction.</Paragraph>
    </Section>
    <Section position="2" start_page="188" end_page="192" type="sub_section">
      <SectionTitle>
4.2 Log-Linear Combination
</SectionTitle>
      <Paragraph position="0"> In modeling for statistical machine translation, a hidden variable a</Paragraph>
      <Paragraph position="2"> , denoting the hidden alignment between the words in the source and target languages, is usually introduced into the string translation probability:</Paragraph>
      <Paragraph position="4"> denotes the lemma-tag representation of the jth word in the input sentence. The sequence T  principle this decision can also be left to the maximum-entropy training, when features for all possible sets of tags are defined, but this would cause the number of parameters to explode. As the experiments in this work have been carried out only with up to three levels of abstraction as defined in Section 4.2, the set of tags of the intermediate level is fixed, and thus the priority of the tags needs not be specified. The relation between this equivalence class hierarchy and the suggestions in Section 2.4 is clear: Choosing candidates for morpho-syntactic tags not relevant for translation amounts to fixing a level in the hierarchy. This is exactly what has been done to define the intermediate level in Section 4.2.</Paragraph>
      <Paragraph position="6"> As has been argued in Section 2.2, the number of readings |T (f j ) |per word form can be reduced to one for the tasks for which experimental results are reported here. The elements in equation (4) are the joint probabilities p(f , T|e) of f and the readings T of f given the target language word e. The maximum-entropy principle recommends choosing for p the distribution which preserves as much uncertainty as possible in terms of maximizing the entropy, while requiring p to satisfy constraints which represent facts known from the data. These constraints are encoded on the basis of feature functions h m (x), and the expectation of each feature h m over the model p is required to be equal to the observed expectation. The maximum-entropy model can be shown to be unique and to have an exponential form involving a weighted sum over the feature functions h</Paragraph>
      <Paragraph position="8"> used again for the lemma-tag representation of an input word (this was denoted by T in equations (2)-(4) for notational simplicity):</Paragraph>
      <Paragraph position="10"> . These model parameters can be trained using converging iterative training procedures like the ones described by Darroch and Ratcliff (1972) or Della Pietra, Della Pietra, and Lafferty (1995).</Paragraph>
      <Paragraph position="11"> In the experiments presented in this article, the sum over the word forms</Paragraph>
      <Paragraph position="13"> in the denominator of equation (5) is restricted to the readings of word forms having the same base form and partial reading as a word form f primeprime aligned at least once to e.</Paragraph>
      <Paragraph position="14"> The new lexicon model p</Paragraph>
      <Paragraph position="16"> |e) can now replace the usual lexicon model p(f|e), over which it has the following main advantages: * The decomposition of the modeled events into feature functions allows meaningful probabilities to be provided for word forms that have not occurred during training as long as the feature functions involved are well-defined. (See also the argument later in the article and the definition of first-level and second-level feature functions presented in Section 4.2.1.) * Introducing the hidden variable T = t</Paragraph>
      <Paragraph position="18"> and constraining the lexicon probability to be zero for interpretations considered nonvalid readings of  into account by the morpho-syntactic analyzer, which chose the valid readings T (f).</Paragraph>
      <Paragraph position="19">  feature functions. We do not need to require that they all have the same parametric form or that the components be disjoint and statistically independent. Still, it is necessary to restrict the number of parameters so that optimizing them is practical. We used the following types of feature functions, which have been defined on the basis of the lemma-tag representation (see Section 2.3): First level: m = {L,~e}, where L is the base form:  In terms of the hierarchy introduced in Section 4.1, this means that information at three different levels in the hierarchy is combined. The subsets T of relevant tags mentioned previously fix the intermediate level.</Paragraph>
      <Paragraph position="20">  This choice of the types of features as well as the choice of the subsets T is reasonable but somewhat arbitrary. Alternatively one can think of defining a much more general set of features and applying some method of feature selection, as has been done, for example, by Foster (2000), who compared different methods for feature selection within the task of translation modeling for statistical machine translation. Note that the log-linear model introduced here uses one parameter per feature. For the Verbmobil task, for example, there are approximately 162, 000 parameters: 47,800 for the first-order features, 55,700 for the second-order features, and 58,500 for the third-order features. No feature selection or threshold was applied: All features seen in training were used.</Paragraph>
      <Paragraph position="21">  lexicon models is depicted in Figure 1. This figure includes the possibility of using restructuring operations as suggested in Section 3 in order to deal with structural differences between the languages involved. This can be especially advantageous in the  Training and test with hierarchical lexicon. &amp;quot;(Inverse) restructuring,&amp;quot; &amp;quot;analyze,&amp;quot; and &amp;quot;annotation&amp;quot; all require morpho-syntactic analysis of the transformed sentences. would raise the question of how to distribute the syntactic tags which have been associated with the whole phrase. In Section 5.2 we describe a method of learning multi-word phrases using conventional dictionaries. The alignment on the training corpus is trained using the original source language corpus containing inflected word forms. This alignment is then used to count the co-occurrences of the annotated &amp;quot;words&amp;quot; in the lemma-tag representation of the source language corpus with the words in the target language corpus. These event counts are used for the maximum-entropy training of the model parameters L.</Paragraph>
      <Paragraph position="22"> The probability mass is distributed over (all readings of) the source language word forms to be supported for test (not necessarily restricted to those occurring during training). The only precondition is that the firing features for these unseen events are known. This &amp;quot;vocabulary supported in test,&amp;quot; as it is called in Figure 1, can be a predefined closed vocabulary, as is the case in Verbmobil, in which the output of a speech recognizer with limited output vocabulary is to be translated. In the easiest case it is identical to the vocabulary found in the source language part of the training corpus. The other extreme would be an extended vocabulary containing all automatically generated inflected forms of all base forms occurring in the training corpus. This vocabulary is annotated with morpho-syntactic tags, ideally under consideration of all possible readings of all word forms.</Paragraph>
      <Paragraph position="23">  Computational Linguistics Volume 30, Number 2 To enable the application of the hierarchical lexicon model, the source language input sentences in test have to be analyzed and annotated with their lemma-tag representation before the actual translation process. So far, the sum over the readings in equation (4) has been ignored, because when the techniques for reducing the amount of ambiguity described in Section 2.2 and the disambiguated conventional dictionaries resulting from the approach presented in Section 5.1 are applied, there remains almost always only one reading per word form.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="192" end_page="196" type="metho">
    <SectionTitle>
5. Conventional Dictionaries
</SectionTitle>
    <Paragraph position="0"> Conventional dictionaries are often used as additional evidence to better train the model parameters in statistical machine translation. The expression conventional dictionary here denotes bilingual collections of word or phrase pairs predominantly collected &amp;quot;by hand,&amp;quot; usually by lexicographers, as opposed to the probabilistic lexica, which are learned automatically. Apart from the theoretical problem of how to incorporate external dictionaries in a mathematically sound way into a statistical framework for machine translation (Brown, Della Pietra, Della Pietra, and Goldsmith 1993) there are also some pragmatic difficulties: As discussed in Section 2.2, one of the disadvantages of these conventional dictionaries as compared to full bilingual corpora is that their entries typically contain single words or short phrases on each language side. Consequently, it is not possible to distinguish among the translations for different readings of a word. In normal bilingual corpora, the words can often be disambiguated by taking into account the sentence context in which they occur. For example, from the context in the sentence Ich werde die Zimmer buchen, it is possible to infer that Zimmer in this sentence is plural and has to be translated by rooms in English, whereas the correct translation of Zimmer in the sentence Ich h&amp;quot;atte gerne ein Zimmer is the singular form room. The dictionary used by our research group for augmenting the bilingual data contains two entries for Zimmer: ('Zimmer'|'room') and ('Zimmer'|'rooms').</Paragraph>
    <Section position="1" start_page="192" end_page="194" type="sub_section">
      <SectionTitle>
5.1 Disambiguation without Context
</SectionTitle>
      <Paragraph position="0"> The approach described in this section is based on the observation that in many of the cases of ambiguous entries in dictionaries, the second part of the entry--that is, the other-language side--contains the information necessary to decide upon the interpretation. In some other cases, the same kind of ambiguity is present in both languages, and it would be possible and desirable to associate the (semantically) corresponding readings with one another. The method proposed here takes advantage of these facts in order to disambiguate dictionary entries.</Paragraph>
      <Paragraph position="1"> Figure 2 sketches the procedure for the disambiguation of a conventional dictionary D. In addition to D, a bilingual corpus C  of the same language pair is required to train the probability model for tag sequence translations. The word forms in C  need not match those in D. C  is not necessarily the training corpus for the translation task in which the disambiguated version of D will be used. It does not even have to be taken from the same domain.</Paragraph>
      <Paragraph position="2"> A word alignment between the sentences in C  is trained with some automatic alignment algorithm. Then the words in the bilingual corpus are replaced by a reduced form of their lemma-tag representation, in which only a subset of their morpho-syntactic tags is retained--even the base form is dropped. The remaining subset of tags, in the following denoted by T f for the source language and T e for the target language, consists of tags considered relevant for the task of aligning corresponding readings. This is not necessarily the same set of tags considered relevant for the task of translation which was used, for example, to fix the intermediate level for the log-linear lexicon  Disambiguation of conventional dictionaries. &amp;quot;Learn phrases,&amp;quot; &amp;quot;analyze,&amp;quot; and &amp;quot;annotation&amp;quot; require morpho-syntactic analysis of the transformed sentences. combination in Section 4.2.1. In the case of the Verbmobil corpus, the maximum length of a tag sequence is five.</Paragraph>
      <Paragraph position="3"> The alignment is used to count the frequency of a certain tag sequence t f in the source language to be associated with another tag sequence t e in the target language and to compute the tag sequence translation probabilities p(t</Paragraph>
      <Paragraph position="5"> ) as relative frequencies. For the time being, these tag sequence translation probabilities associate readings of words in one language with readings of words in the other language: Multiword sequences are not accounted for.</Paragraph>
      <Paragraph position="6"> To alleviate this shortcoming it is possible and advisable to automatically detect and merge multiword phrases. As will be described in Section 5.2, the conventional bilingual dictionary itself can be used to learn and validate these phrases. The resulting multiword phrases P e for the target language and P f for the source language are afterwards concatenated within D to form entries consisting of pairs of &amp;quot;units.&amp;quot; The next step is to analyze the word forms in D and generate all possible readings of all entries. It is also possible to ignore those readings that are considered unlikely for the task under consideration by applying the domain-specific preference rules proposed in Section 2.2. The process of generating all readings includes replacing word forms with their lemma-tag representation, which is thereafter reduced by dropping all morpho-syntactic tags not contained in the tag sets T</Paragraph>
      <Paragraph position="8"> ), the readings in one language are aligned with readings in the other language. These alignments are applied to the full lemma-tag representation (not only tags in T</Paragraph>
      <Paragraph position="10"> ) of the expanded dictionary containing one entry per reading of the original word forms. The highest-ranking aligned readings according to p(t</Paragraph>
      <Paragraph position="12"> ) for each lemma are preserved.</Paragraph>
      <Paragraph position="13">  Computational Linguistics Volume 30, Number 2 The resulting disambiguated dictionary contains two entries for the German word Zimmer:('Zimmer-noun-sg.'|'room-noun-sg.') and ('Zimmer-noun-pl.'|'room-nounpl.'). The target language part is then reduced to the surface forms: ('Zimmer-noun-sg.'| 'room') and ('Zimmer-noun-pl.'|'rooms'). Note that this augmented dictionary, in the following denoted by D prime , has more entries than D as a result of the step of generating all readings. The two entries ('beabsichtigt'|'intends') and ('beabsichtigt'|'intended'), for example, produce three new entries: ('beabsichtigt-verb-ind.-pres.-sg.3rd'|'intends'), ('beabsichtigt-verb-past-part.'|'intended'), and ('beabsichtigtadjective-pos.'|'intended'). null</Paragraph>
    </Section>
    <Section position="2" start_page="194" end_page="196" type="sub_section">
      <SectionTitle>
5.2 Multiword Phrases
</SectionTitle>
      <Paragraph position="0"> Some recent publications deal with the automatic detection of multiword phrases (Och and Weber 1998; Tillmann and Ney 2000). These methods are very useful, but they have one drawback: They rely on sufficiently large training corpora, because they detect the phrases from automatically learned word alignments. In this section a method for detecting multiword phrases is suggested which merely requires monolingual syntactic analyzers and a conventional dictionary.</Paragraph>
      <Paragraph position="1"> Some multiword phrases which jointly fulfill a syntactic function are provided by the analyzers. The phrase irgend etwas ('anything'), for example, may form either an indefinite determiner or an indefinite pronoun. irgend=etwas is merged by the analyzer in order to form one single vocabulary entry. In the German part of the Verbmobil training corpus 26 different, nonidiomatic multiword phrases are merged, while there are 318 phrases suggested for the English part. In addition, syntactic information like the identification of infinitive markers, determiners, modifying adjectives (for example, single room), premodifying adverbials (more comfortable), and premodifying nouns (account number) are used for detecting multiword phrases. When applied to the English part of the Verbmobil training corpus, these hints suggest 7,225 different phrases.</Paragraph>
      <Paragraph position="2"> Altogether, 26 phrases for German and about 7,500 phrases for English are detected in this way. It is quite natural that there are more multiword phrases found for English, as German, unlike English, uses compounding. But the experiments show that it is not advantageous to use all these phrases for English. Electronic dictionaries can be useful for detecting those phrases which are important in a statistical machine translation context: A multiword phrase is considered useful if it is translated into a single word or a distinct multiword phrase (suggested in a similar way by syntactic analysis) in another language. There are 290 phrases chosen in this way for the English language.</Paragraph>
      <Paragraph position="3"> 6. Overall Procedure for Training with Scarce Resources Taking into account the interdependencies of inflected forms of the same base form is especially relevant when inflected languages like German are involved and when training data are sparse. In this situation many of the inflected word forms to account for in test do not occur during training. Sparse bilingual training data also make additional conventional dictionaries especially important. Enriching the dictionaries by aligning corresponding readings is particularly useful when the dictionaries are used in conjunction with a hierarchical lexicon, which can access the information necessary to distinguish readings via morpho-syntactic tags. The restructuring operations described in Section 3 also help in coping with the data sparseness problem, because they make corresponding sentences more similar. This section proposes a procedure for combining all these methods in order to improve the translation quality despite sparseness of data. Figure 3 sketches the proposed procedure.</Paragraph>
      <Paragraph position="4">  Training with scarce resources. &amp;quot;Restructuring,&amp;quot; &amp;quot;learn phrases,&amp;quot; and &amp;quot;annotation&amp;quot; all require morpho-syntactic analysis of the transformed sentences.</Paragraph>
      <Paragraph position="5"> Two different bilingual corpora C  can, but need not, be distinct, and that the monolingual corpus can be identical to the target language part of  has to represent the domain and the vocabulary for which the translation system is built, and only the size of C  and the monolingual corpus have a substantial effect on the translation quality. It is interesting to note, though, that a basic statistical machine translation system with an accuracy near 50% can be built without any domain-specific bilingual corpus C  , solely on the basis of a disambiguated dictionary and the hierarchical lexicon models, as Table 9 shows.</Paragraph>
      <Paragraph position="6"> * In the first step, multiword phrases are learned and validated on the dictionary D in the way described in Section 5.2. These multiword phrases are concatenated in D. Then an alignment is trained on the first bilingual corpus C  . On the basis of this alignment, the tag sequence translation probabilities which are needed to align corresponding readings in the dictionary are extracted, as proposed in Section 5.1. The result of this step is an expanded and disambiguated dictionary D  Computational Linguistics Volume 30, Number 2 can be comparatively small, given the limited number of tag sequence</Paragraph>
      <Paragraph position="8"> ) for which translation probabilities must be provided: In the Verbmobil training corpus, for example, there are only 261 different German and 110 different English tag sequences.</Paragraph>
      <Paragraph position="9">  as input to the maximum-entropy training of a hierarchical lexicon model as described in Section 4.2.</Paragraph>
      <Paragraph position="10"> * The language model can be trained on a separate monolingual corpus. As monolingual data are much easier and cheaper to compile, this corpus might be (substantially) larger than the target language part of C  .</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="196" end_page="199" type="metho">
    <SectionTitle>
7. Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="196" end_page="196" type="sub_section">
      <SectionTitle>
7.1 The Tasks and the Corpora
</SectionTitle>
      <Paragraph position="0"> Tests were carried out on Verbmobil data and on Nespole! data. As usual, the sentences from the test sets were not used for training. The training corpora were used for training the parameters of IBM model 4.</Paragraph>
      <Paragraph position="1"> 7.1.1 Verbmobil. Verbmobil was a project for automatic translation of spontaneously spoken dialogues. A detailed description of the statistical translation system within Verbmobil is given by Ney et al. (2000) and by Och (2002). Table 5 summarizes the characteristics of the English and German parallel corpus used for training the parameters of IBM model 4. A conventional dictionary complements the training corpus (see Table 6 for the statistics). The vocabulary in Verbmobil was considered closed: There are official lists of word forms which can be produced by the speech recognizers. Such lists exist for German and English (see Table 7). Table 8 lists the characteristics of the two test sets Test and Develop taken from the end-to-end evaluation in Verbmobil, the development part being meant to tune system parameters on a held-out corpus different from the training as well as the test corpus. As no parameters are optimized on the development set for the methods described in this article, most of the experiments were carried out on a joint set containing both test sets.</Paragraph>
      <Paragraph position="2"> Table 5 Statistics of corpora for training: Verbmobil and Nespole! Singletons are types occurring only once in training.</Paragraph>
      <Paragraph position="3">  7.1.2 Nespole!. Nespole! is a research project that ran from January 2000 to June 2002. It aimed to provide multimodel support for negotiation (Nespole! 2000; Lavie et al. 2001). Table 5 summarizes the corpus statistics of the Nespole! training set. Table 8 provides the corresponding figures for the test set used in this work.</Paragraph>
    </Section>
    <Section position="2" start_page="196" end_page="196" type="sub_section">
      <SectionTitle>
7.2 The Translation System
</SectionTitle>
      <Paragraph position="0"> For testing we used the alignment template translation system, described in Och, Tillmann, and Ney (1999). Training the parameters for this system entails training of IBM model 4 parameters in both translation directions and combining the resulting alignments into one symmetrized alignment. From this symmetrized alignment, the lexicon probabilities as well as the so-called alignment templates are extracted. The latter are translation patterns which capture phrase-level translation pairs.</Paragraph>
    </Section>
    <Section position="3" start_page="196" end_page="198" type="sub_section">
      <SectionTitle>
7.3 Performance Measures
</SectionTitle>
      <Paragraph position="0"> The following evaluation criteria were used in the experiments: BLEU (Bilingual Evaluation Understudy): This score, proposed by Papineni et al. (2001), is based on the notion of modified n-gram precision, with n [?]{1, ...,4}: All candidate unigram, bigram, trigram, and four-gram counts are collected and clipped against their corresponding maximum reference counts. The reference n-gram counts are calculated on a corpus  Computational Linguistics Volume 30, Number 2 of reference translations for each input sentence. The clipped candidate counts are summed and normalized by the total number of candidate ngrams. The geometric mean of the modified precision scores for a test corpus is calculated and multiplied by an exponential brevity penalty factor to penalize too-short translations. BLEU is an accuracy measure, while the others are error measures.</Paragraph>
      <Paragraph position="1"> m-WER (multireference word error rate): For each test sentence there is a set of reference translations. For each translation hypothesis, the edit distance (number of substitutions, deletions, and insertions) to the most similar reference is calculated.</Paragraph>
      <Paragraph position="2"> SSER (subjective sentence error rate): Each translated sentence is judged by a human examiner according to an error scale from 0.0 (semantically and syntactically correct) to 1.0 (completely wrong).</Paragraph>
      <Paragraph position="3"> ISER (information item semantic error rate): The test sentences are segmented into information items; for each of these items, the translation candidates are assigned either &amp;quot;OK&amp;quot; or an error class. If the intended information is conveyed, the translation of an information item is considered correct, even if there are slight syntactic errors which do not seriously deteriorate the intelligibility.</Paragraph>
      <Paragraph position="4"> For evaluating the SSER and the ISER, we have used the evaluation tool EvalTrans (Niessen and Leusch 2000), which is designed to facilitate the work of manually judging evaluation quality and to ensure consistency over time and across evaluators.</Paragraph>
    </Section>
    <Section position="4" start_page="198" end_page="199" type="sub_section">
      <SectionTitle>
7.4 Impact of the Corpus Size
</SectionTitle>
      <Paragraph position="0"> It is a costly and time-consuming task to compile large texts and have them translated to form bilingual corpora suitable for training the model parameters for statistical machine translation. As a consequence, it is important to investigate the amount of data necessary to sufficiently cover the vocabulary expected in testing. Furthermore, we want to examine to what extent the incorporation of morphological knowledge sources can reduce this amount of necessary data. Figure 4 shows the relation between the size of a typical German corpus and the corresponding number of different full forms. At the size of 520,000 words, the size of the Verbmobil corpus used for training, this curve still has a high growth rate.</Paragraph>
      <Paragraph position="1"> To investigate the impact of the size of the bilingual corpus available for training, on translation quality three different setups for training the statistical lexicon on Verbmobil data have been defined: * using the full training corpus as described in Table 5, comprising 58,000 sentences * restricting the corpus to 5,000 sentences (approximately every 11th sentence) * using no bilingual training corpus at all (only a bilingual dictionary; see subsequent discussion) The language model is always trained on the full English corpus. The argument for this is that monolingual corpora are always easier and less expensive to obtain than bilingual corpora. A conventional dictionary is used in all three setups to complement  Impact of corpus size (measured in number of running words in the corpus) on vocabulary size (measured in number of different full-form words found in the corpus) for the German part of the Verbmobil corpus.</Paragraph>
      <Paragraph position="2"> the bilingual corpus. In the last setup, the lexicon probabilities are trained exclusively on this dictionary As Table 9 shows, the quality of translation drops significantly when the amount of bilingual data available during training is reduced: When the training corpus is restricted to 5,000 sentences, the SSER increases by about 7% and the ISER by about 3%. As could be expected, the translations produced by the system trained exclusively on a conventional dictionary are very poor: The SSER jumps over 60%.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML