File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1062_metho.xml

Size: 25,184 bytes

Last Modified: 2025-10-06 14:07:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1062">
  <Title>The RWTH System for Statistical Translation of Spoken Dialogues</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. STATISTICAL DECISION THEORY
AND LINGUISTICS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The Statistical Approach
</SectionTitle>
      <Paragraph position="0"> The use of statistics in computational linguistics has been extremely controversial for more than three decades. The controversy is very well summarized by the statement of Chomsky in 1969 [6]: &amp;quot;It must be recognized that the notion of a 'probability of a sentence' is an entirely useless one, under any interpretation of this term&amp;quot;.</Paragraph>
      <Paragraph position="1"> This statement was considered to be true by the majority of experts from artificial intelligence and computational linguistics, and the concept of statistics was banned from computational linguistics for many years.</Paragraph>
      <Paragraph position="2"> What is overlooked in this statement is the fact that, in an automatic system for speech recognition or text translation, we are faced with the problem of taking decisions. It is exactly here where statistical decision theory comes in. In speech recognition, the success of the statistical approach is based on the equation:  Similarly, for machine translation, the statistical approach is expressed by the equation:</Paragraph>
      <Paragraph position="4"> For the 'low-level' description of speech and image signals, it is widely accepted that the statistical framework allows an efficient coupling between the observations and the models, which is often described by the buzz word 'subsymbolic processing'. But there is another advantage in using probability distributions in that they offer an explicit formalism for expressing and combining hypothesis scores: + The probabilities are directly used as scores: These scores are normalized, which is a desirable property: when increasing the score for a certain element in the set of all hypotheses, there must be one or several other elements whose scores are reduced at the same time.</Paragraph>
      <Paragraph position="5"> + It is straightforward to combine scores: depending on the task, the probabilities are either multiplied or added.</Paragraph>
      <Paragraph position="6"> + Weak and vague dependencies can be modelled easily. Especially in spoken and written natural language, there are nuances and shades that require 'grey levels' between 0 and 1.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Bayes Decision Rule and
System Architecture
</SectionTitle>
      <Paragraph position="0"> In machine translation, the goal is the translation of a text given in a source language into a target language. We are given a source string fJ1 = f1:::fj:::fJ, which is to be translated into a target string eI1 = e1:::ei:::eI. In this article, the term word always refers to a full-form word. Among all possible target strings, we will choose the string with the highest probability which is given by Bayes decision rule [5]:</Paragraph>
      <Paragraph position="2"> Here, Pr(eI1) is the language model of the target language, and Pr(fJ1 jeI1) is the string translation model which will be decomposedintolexiconandalignmentmodels. Theargmax operation denotes the search problem, i.e. the generation of the output sentence in the target language. The overall architecture of the statistical translation approach is summarized in Figure 1.</Paragraph>
      <Paragraph position="3"> In general, as shown in this figure, there may be additional transformations to make the translation task simpler for the algorithm. The transformations may range from the categorization of single words and word groups to more complex preprocessing steps that require some parsing of the source string. We have to keep in mind that in the search procedure both the language and the translation model are applied after the text transformation steps. However, to keep the notation simple, we will not make this explicit distinction in the subsequent exposition.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. ALIGNMENT MODELLING
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Concept
</SectionTitle>
      <Paragraph position="0"> A key issue in modelling the string translation probability Pr(fJ1 jeI1) is the question of how we define the correspondence between the words of the target sentence and the words of the source sentence. In typical cases, we can assume a sort of pairwise dependence by considering all word pairs (fj;ei) for a given sentence pair (fJ1 ;eI1). Here, we will further constrain this model by assigning each source word to exactly one target word. Later, this requirement will be relaxed. Models describing these types of dependencies are referred to as alignment models [5, 24].</Paragraph>
      <Paragraph position="1"> When aligning the words in parallel texts, we typically observe a strong localization effect. Figure 2 illustrates this effect for the language pair German-English. In many cases, although not always, there is an additional property: over large portions of the source string, the alignment is monotone. null  based on Bayes decision rule.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Basic Models
</SectionTitle>
      <Paragraph position="0"> To arrive at a quantitative specification, we define the alignment mapping: j ! i = aj; which assigns a word fj in position j to a word ei in position i = aj. We rewrite the probability for the translation model by introducing the 'hidden' alignments aJ1 := a1:::aj:::aJ for each sentence pair (fJ1 ;eI1). To structure this probability distribution, we factorize it over the positions in the source sentence and limit the alignment dependencies to a first-order dependence:</Paragraph>
      <Paragraph position="2"> Here, we have the following probability distributions: + the sentence length probability: p(JjI), which is included here for completeness, but can be omitted without loss of performance; + the lexicon probability: p(fje); + the alignment probability: p(ajjaj!1;I;J).</Paragraph>
      <Paragraph position="3"> By making the alignment probability p(ajjaj!1;I;J) dependent on the jump width aj ! aj!1 instead of the absolute positions aj, we obtain the so-called homogeneous hidden Markov model, for short HMM [24].</Paragraph>
      <Paragraph position="4"> We can also use a zero-order model p(ajjj;I;J), where there is only a dependence on the absolute position index j of the source string. This is the so-called model IBM-2 [5]. Assuming a uniform alignment probability p(ajjj;I;J) = 1=I, we arrive at the so-called model IBM-1.</Paragraph>
      <Paragraph position="5"> These models can be extended to allow for source words having no counterpart in the translation. Formally, this is incorporated into the alignment models by adding a so-called 'empty word' at position i = 0 to the target sentence and aligning all source words without a direct translation to this empty word.</Paragraph>
      <Paragraph position="6">  In [5], more refined alignment models are introduced by using the concept of fertility. The idea is that often a word in the target language may be aligned to several words in the source language. This is the so-called model IBM-3. Using, in addition, first-order alignment probabilities along the positions of the source string leads us to model IBM-4. Although these models take one-to-many alignments explicitly into account, the lexicon probabilities p(fje) are still based on single words in each of the two languages.</Paragraph>
      <Paragraph position="7"> In systematic experiments, it was found that the quality of the alignments determined from the bilingual training corpus has a direct effect on the translation quality [14].</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Alignment Template Approach
</SectionTitle>
      <Paragraph position="0"> A general shortcoming of the baseline alignment models is that they are mainly designed to model the lexicon dependences between single words. Therefore, we extend the approach to handle word groups or phrases rather than single words as the basis for the alignment models [15]. In other words, a whole group of adjacent words in the source sentence may be aligned with a whole group of adjacent words in the target language. As a result, the context of words tends to be explicitly taken into account, and the differences in local word orders between source and target languages can be learned explicitly. Figure 3 shows some of the extracted alignment templates for a sentence pair from the Verbmobil training corpus. The training algorithm for the alignment templates extracts all phrase pairs which are aligned in the training corpus up to a maximum length of 7 words. To improve the generalization capability of the alignment templates, the templates are determined for bilingual word classes rather than words directly. These word classes are determined by an automatic clustering procedure [13].</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. SEARCH
</SectionTitle>
    <Paragraph position="0"> The task of the search algorithm is to generate the most likely target sentence eI1 of unknown length I for an observed source sentence fJ1 . The search must make use of all three knowledge sources as illustrated by Figure 4: the alignment model, the lexicon model and the language model. All three  ofthemmustcontributeinthefinaldecision aboutthewords in the target language.</Paragraph>
    <Paragraph position="1"> To illustrate the specific details of the search problem, we slightly change the definitions of the alignments: + we use inverted alignments as in the model IBM-4 [5] which define a mapping from target to source positions rather the other way round.</Paragraph>
    <Paragraph position="2"> + we allow several positions in the source language to be covered, i.e. we consider mappings B of the form: B : i ! Bi %0f1;:::j;:::Jg We replace the sum over all alignments by the best alignment, which is referred to as maximum approximation in speech recognition. Using a trigram language model p(eij;ei!2;ei!1), we obtain the following search criterion:</Paragraph>
    <Paragraph position="4"> Considering this criterion, we can see that we can build up hypotheses of partial target sentences in a bottom-to-top strategy over the positions i of the target sentence ei1 as illustrated in Figure 5. An important constraint for the alignment is that all positions of the source sentence should be covered exactly once. This constraint is similar to that of the travelling salesman problem where each city has to be visited exactly once. Details on various search strategies can be found in [4, 9, 12, 21].</Paragraph>
    <Paragraph position="5"> In order to take long context dependences into account, we use a class-based five-gram language model with backingoff. Beam-search is used to handle the huge search space. To normalize the costs of partial hypotheses covering different parts of the input sentence, an (optimistic) estimation of the remaining cost is added to the current accumulated cost as follows. For each word in the source sentence, a lower bound on its translation cost is determined beforehand. Using this</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SENTENCE INSOURCE LANGUAGE
TRANSFORMATION
SENTENCE GENERATEDIN TARGET LANGUAGE
SENTENCE
KNOWLEDGE SOURCESSEARCH: INTERACTION OF KNOWLEDGE SOURCES
WORD + POSITION
ALIGNMENT
LANGUAGE MODEL
BILINGUAL LEXICON
ALIGNMENTMODELWORD RE-ORDERING
SYNTACTIC ANDSEMANTIC ANALYSIS
LEXICAL CHOICE
HYPOTHESES
HYPOTHESES
HYPOTHESES
TRANSFORMATION
</SectionTitle>
    <Paragraph position="0"> lower bound, it is possible to achieve an efficient estimation of the remaining cost.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. EXPERIMENTAL RESULTS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 The Task and the Corpus
</SectionTitle>
      <Paragraph position="0"> Within the Verbmobil project, spoken dialogues were recorded. These dialogues were manually transcribed and later manually translated by Verbmobil partners (Hildesheim for Phase I and T&amp;quot;ubingen for Phase II). Since different human translators were involved, there is great variability in the translations.</Paragraph>
      <Paragraph position="1"> Each of these so-called dialogues turns may consist of several sentences spoken by the same speaker and is sometimes</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SOURCE POSITION
TARGET POSITION
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> rather long. As a result, there is no one-to-one correspondence between source and target sentences. To achieve a one-to-one correspondence, the dialogue turns are split into shorter segments using punctuation marks as potential split points. Since the punctuation marks in source and target sentences are not necessarily identical, a dynamic programming approach is used to find the optimal segmentation points. The number of segments in the source sentence and in the test sentence can be different. The segmentation is scored using a word-based alignment model, and the segmentation with the best score is selected. This segmented corpus is the starting point for the training of translation and language models. Alignment models of increasing complexity are trained on this bilingual corpus [14].</Paragraph>
    <Paragraph position="3"> A standard vocabulary had been defined for the various speech recognizers used in Verbmobil. However, not all words of this vocabulary were observed in the training corpus. Therefore, the translation vocabulary was extended semi-automatically by adding about 13000 German-English word pairs from an online bilingual lexicon available on the web. The resulting lexicon contained not only word-word entries, but also multi-word translations, especially for the large number of German compound words. To counteract the sparseness of the training data, a couple of straightforward rule-based preprocessing steps were applied before any other type of processing: + categorization of proper names for persons and cities,  + normalization of: - numbers, - time and date phrases, - spelling: don't ! do not,...</Paragraph>
    <Paragraph position="4"> + splitting of  German compound words.</Paragraph>
    <Paragraph position="5"> Table 1 gives the characteristics of the training corpus and the lexicon. The 58000 sentence pairs comprise about half a million running words for each language of the bilingual training corpus. The vocabulary size is the number of distinct full-form words seen in the training corpus. Punctuation marks are treated as regular words in the translation approach. Notice the large number of word singletons, i. e. words seen only once. The extended vocabulary is the vocabulary after adding the manual bilingual lexicon.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Offline Results
</SectionTitle>
      <Paragraph position="0"> During the progress of the Verbmobil project, different variants of statistical translation were implemented, and ex- null perimental tests were performed for both text and speech input. To summarize these experimental tests, we briefly report experimental offline results for the following translation approaches: + single-word based approach [20]; + alignment template approach [15]; + cascaded transducer approach [23]: unlike the other two-approaches, this approach requires a semi-automatic training procedure, in which the structure of the finite state transducers is designed manually. For more details, see [23].</Paragraph>
      <Paragraph position="1"> The offline tests were performed on text input for the translation direction from German to English. The test set consisted of 251 sentences, which comprised 2197 words and 430 punctuation marks. The results are shown in Table 2. To judge and compare the quality of different translation approaches in offline tests, we typically use the following error measures [11]: + mWER (multi-reference word error rate): For each test sentence sk in the source language, there are several reference translationsRk = frk1;::: ;rknkg in the target language. For each translation of the test sentence sk, the edit distances (number of substitutions, deletions and insertions as in speech recognition) to all sentences in Rk are calculated, and the smallest distance is selected and used as error measure.</Paragraph>
      <Paragraph position="2"> + SSER (subjective sentence error rate): Each translated sentence is judged by a human examiner according to an error scale from 0.0 (semantically and syntactically correct) to 1.0 (completely wrong).</Paragraph>
      <Paragraph position="3"> Both error measures are reported in Table 2. Although the experiments with the cascaded transducers [23] were not fully optimized yet, the preliminary results indicated that this semi-automatic approach does not generalize as well as the other two fully automatic approaches. Among these two, the alignment template approach was found to work consistently better across different test sets (and also tasks different from Verbmobil). Therefore, the alignment template approach was used in the final Verbmobil prototype system.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Disambiguation Examples
</SectionTitle>
      <Paragraph position="0"> In the statistical translation approach as we have presented it, no explicit word sense disambiguation is performed. However, a kind of implicit disambiguation is possible due to the context information of the alignment templates and the language model as shown by the examples in Table 3. The first two groups of sentences contain the  verbs 'gehen' and 'annehmen' which have different translations, some of which are rather collocational. The correct translation is only possible by taking the whole sentence into account. Some improvement can be achieved by applying morpho-syntactic analysis, e.g handling of the separated verb prefixes in German [10]. The last two sentences show the implicit disambiguation of the temporal and spatial sense for the German preposition 'vor'. Although the system has not been tailored to handle such types of disambiguation, the translated sentences are all acceptable, apart from the sentence: The meeting is to five.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Integration into the Verbmobil Prototype
System
</SectionTitle>
      <Paragraph position="0"> The statistical approach to machine translation is embodied in the stattrans module which is integrated into the Verbmobil prototype system. We briefly review those aspects of it that are relevant for the statistical translation approach. The implementation supports the translation directions from German to English and from English to German.</Paragraph>
      <Paragraph position="1"> In regular processing mode, the stattrans module receives its input from the repair module [18]. At that time, the word lattices and best hypotheses from the speech recognition systems have already been prosodically annotated, i.e. information about prosodic segment boundaries, sentence mode and accentuated syllables are added to each edge in the word lattice [2]. The translation is performed on the single best sentence hypothesis of the recognizer.</Paragraph>
      <Paragraph position="2"> The prosodic boundaries and the sentence mode information are utilized by the stattrans module as follows. If there is a major phrase boundary, a full stop or question mark is inserted into the word sequence, depending on the sentence mode as indicated by the prosody module. Additional commas are inserted for other types of segment boundaries. The prosody module calculates probabilities for segment boundaries, and thresholds are used to decide if the sentence marks are to be inserted. These thresholds have been selected in such a way that, on the average, for each dialogue turn, a good segmentation is obtained. The segment boundaries restrict possible word reordering between source and target language. This not only improves translation quality, but also restricts the search space and thereby speeds up the translation process.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.5 Large-Scale End-to-End Evaluation
</SectionTitle>
      <Paragraph position="0"> Whereas the offline tests reported above were important for the optimization and tuning of the system, the most important evaluation was the final evaluation of the Verbmobil prototype in spring 2000. This end-to-end evaluation of the Verbmobil system was performed at the University of Hamburg [19]. In each session of this evaluation, two native speakers conducted a dialogue. They did not have any direct contact and could only interact by speaking and listening to the Verbmobil system.</Paragraph>
      <Paragraph position="1"> Three other translation approaches had been integrated into the Verbmobil prototype system: + a classical transfer approach [3, 7, 22], which is based on a manually designed analysis grammar, a set of transfer rules, and a generation grammar, + a dialogue act based approach [16], which amounts to a sort of slot filling by classifying</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Ambiguous Word Text Input Translation
</SectionTitle>
      <Paragraph position="0"> gehen Wir gehen ins Theater. We will go to the theater.</Paragraph>
      <Paragraph position="1"> Mir geht es gut. I am fine.</Paragraph>
      <Paragraph position="2"> Es geht um Geld. It is about money.</Paragraph>
      <Paragraph position="3"> Geht es bei Ihnen am Montag? Is it possible for you on Monday? Das Treffen geht bis 5 Uhr. The meeting is to five.</Paragraph>
      <Paragraph position="4"> annehmen Wir sollten das Angebot annehmen. We should accept that offer. Ich nehme das Schlimmste an. I will assume the worst./ vor Wir treffen uns vor dem Fr&amp;quot;uhst&amp;quot;uck. We meet before the breakfast. Wir treffen uns vor dem Hotel. We will meet in front of the hotel. each sentence into one out of a small number of possible sentence patterns and filling in the slot values, + an example-based approach [1], where a sort of nearest neighbour concept is applied to the set of bilingual training sentence pairs after suitable preprocessing.</Paragraph>
      <Paragraph position="5"> In the final end-to-end evaluation, human evaluators judged the translation quality for each of the four translation results using the following criterion: Is the sentence approximatively correct: yes/no? The evaluators were asked to pay particular attention to the semantic information (e.g. date and place of meeting, participants etc) contained in the translation. A missing translation as it may happen for the transfer approach or other approaches was counted as wrong translation. The evaluation was based on 5069 dialogue turns for the translation from German to English and on 4136 dialogue turns for the translation from English to German. The speech recognizers used had a word error rate of about 25%. The overall sentence error rates, i.e. resulting from recognition and translation, are summarized in Table 4. As we can see, the error rates for the statistical approach are smaller by a factor of about 2 in comparison with the other approaches.</Paragraph>
      <Paragraph position="6"> In agreement with other evaluation experiments, these experiments show that the statistical modelling approach may be comparable to or better than the conventional rule-based approach. In particular, the statistical approach seems to have the advantage if robustness is important, e.g. when the input string is not grammatically correct or when it is corrupted by recognition errors.</Paragraph>
      <Paragraph position="7"> Although both text and speech input are translated with good quality on the average by the statistical approach,  there are examples where the syntactic structure of the produced sentence is not correct. Some of these syntactic errors are related to long range dependencies and syntactic structures that are not captured by the m-gram language model used. To cope with these problems, morpho-syntactic analysis [10] and grammar-based language models [17] are currently being studied.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML