File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1405_metho.xml

Size: 10,860 bytes

Last Modified: 2025-10-06 14:07:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1405">
  <Title>Stochastic Modelling: From Pattern Classification to Language Translation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ASR = Acoustic-Linguistic Modelling
+ Statistical Decision Theory
</SectionTitle>
    <Paragraph position="0"> Similarly, for machine translation (MT), the statistical approach is expressed by the equation:</Paragraph>
    <Paragraph position="2"> For the 'low-level' description of speech and image signals, it is widely accepted that the stochastic framework allows an efficient coupling between the observations and the models, which is often described by the buzz word 'subsymbolic processing'. But there is another advantage in using probability distributions in that they offer an explicit formalism for expressing and combining hypothesis scores: + The probabilities are directly used as scores: These scores are normalized, which is a desirable property: when increasing the score for a certain element in the set of all hypotheses, there must be one or several other elements whose scores are reduced at the same time.</Paragraph>
    <Paragraph position="3"> + It is evident how to combine scores: depending on the task, the probabilities are either multiplied or added.</Paragraph>
    <Paragraph position="4"> + Weak and vague dependencies can be modelled easily. Especially in spoken and written natural language, there are nuances and shades that require 'grey levels' between 0 and 1.</Paragraph>
    <Paragraph position="5"> Even if we think we can manage without statistics, we will need models which always have some free parameters. Then the question is how to train these free parameters. The obvious approach is to adjust these parameters in such a way that we get optimal results in terms of error rates or similar criteria on a representative sample. So we have made a complete cycle and have reached the starting point of the stochastic modelling approach again! When building an automatic system for speech or language, we should try to use as much prior knowledge as possible about the task under consideration. This knowledge is used to guide the modelling process and to enable improved generalization with respect to unseen data. Therefore in a good stochastic modelling approach, we try to identify the common patterns underlying the observations, i.e. to capture dependencies between the data in order to avoid the pure 'black box' concept.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Language Translation as Pattern
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Classification
2.1 Bayes Decision Rule
</SectionTitle>
      <Paragraph position="0"> Knowing that language translation is a difficult task, we want to keep the number of wrong translations as small as possible. The corresponding formalism is provided by the so-called statistical decision theory. The resulting decison rule is referred to as Bayes decision rule and is the starting point for many techniques in pattern classification (Duda et al. 2001). To classify an observation vector y into one out of several classes c, the Bayes decision rule is:</Paragraph>
      <Paragraph position="2"> For language translation, the starting point is the observed sequence of source symbols y = fJ1 = f1:::fJ, i.e. the sequence of source words, for which the target word sequence c = eI1 = e1:::eI has to be determined. In order to minimize the number of decision errors at the sentence level, we have to choose the sequence of target words ^eI1 according to the equation (Brown et al. 1993):</Paragraph>
      <Paragraph position="4"> Here, the posterior probability Pr(eI1jfJ1 ) is decomposed into the language model probability Pr(eJ1) and the string translation probability Pr(fJ1 jeI1). Due to this factorization, we have two separate probability distributions which can be modelled and trained independently of each other.</Paragraph>
      <Paragraph position="5"> Fig.1 shows the architecture that results from the Bayes decision theory. Here we have already taken into account that, in order to implement the string translation model, we will decompose it into a so-called alignment model and a lexicon model. As also shown in this figure, we explicitly allow for optional transformations to make the translation task simpler for the algorithm.</Paragraph>
      <Paragraph position="6"> In total, we have the following crucial constituents of the stochastic modelling approach to language translation: + There are two separate probability distributions or stochastic knowledge sources: - the language model distribution Pr(eI1), which is assigned to each possible target word sequence eI1 and which ultimately captures all syntactic, semantic and pragmatic constraints of the target language domain under consideration; - the string translation probability distribution Pr(fJ1 jeI1) which assigns a score as to how well the source string fJ1 matches the hypothesized target sequence eI1.</Paragraph>
      <Paragraph position="7"> + In addition to these two knowledge sources, we need another system component which is referred to as a search or decision process. According to the Bayes decision rule, this search has to carry out the maximization of the product of the two probability distributions and thus ensures an optimal interaction of the two knowledge sources.</Paragraph>
      <Paragraph position="8">  Note that there is a guarantee of the minimization of decision errors if we know the true probability distributions Pr(eI1) and Pr(fJ1 jeI1) and if we carry out a full search over all target word sequences eI1. In addition, it should be noted that both the sequence of source words fJ1 and the sequence of unknown target words eI1 are modelled as a whole. The advantage then is that context dependencies can be fully taken into account and the syntactic analysis of both source and target sequences (at least in principle) can be integrated into the translation process.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Implementation of Stochastic Modelling
</SectionTitle>
      <Paragraph position="0"> To build a real operational system for language translation, we are faced with the following three problems: + Search problem: In principle, the innocent looking maximization requires the evaluation of 2000010 = 1043 possible target word sequences, when we assume a vocabulary of 20 000 target words and a sentence length of I = 10 words. This is the price we have to pay for a full interaction between the language model Pr(eI1) and and the string translation model Pr(fJ1 jeI1). In such a way, however, it is guaranteed that there is no better way to take the decisions about the words in the target language (for the given probability distributions Pr(eI1) and Pr(fJ1 jeI1)). In a practical system, we of course use suboptimal search strategies which require much less effort than a full search, but nevertheless should find the global optimum in virtually all cases.</Paragraph>
      <Paragraph position="1"> + Modelling problem: The two probability distributions Pr(eI1) and Pr(fJ1 jeI1) are too general to be used in a table look-up approach, because there is a huge number of possible values fJ1 and eI1. Therefore we have to introduce suitable structures into the distributions such that the number of free parameters is drastically reduced by taking suitable data dependencies into account.</Paragraph>
      <Paragraph position="2"> A key issue in modelling the string translation probability Pr(fJ1 jeI1) is the question of how we define the correspondence between the words of the target sentence and the words of the source sentence.</Paragraph>
      <Paragraph position="3"> In typical cases, we can assume a sort of pairwise dependence by considering all word pairs (fj;ei) for a given sentence pair (fJ1 ;eI1). Typically, the dependence is further constrained by assigning each source word to exactly one target word. Models describing these types of dependencies are referred to as alignment mappings (Brown et al. 1993): alignment mapping: j ! i = aj ; which assigns a source word fj in position j to a target word ei in position i = aj. As a result, the string translation probability can be decomposed into a lexicon probability and an alignment probability (Brown et al.</Paragraph>
      <Paragraph position="4"> 1993).</Paragraph>
      <Paragraph position="5"> + Training problem: After choosing suitable models for the two distributions Pr(eI1) and Pr(fJ1 jeI1), there remain free parameters that have to be learned from a set of training observations, which in the statistical terminology is referred to as parameter estimation.</Paragraph>
      <Paragraph position="6"> For several reasons, especially for the interdependence of the parameters, this learning task typically results in a complex mathematical optimization problem the details of which depend on the chosen model and on the chosen training criterion (such as maximum likelihood, squared error criterion, discriminative criterion, minimum number of recognition errors, ...).</Paragraph>
      <Paragraph position="7"> In conclusion, stochastic modelling as such does not solve the problems of automatic language translation, but defines a basis on which we can find the solutions to the problems. In contradiction to a widely held belief, a stochastic approach may very well require a specific model, and statistics helps us to make the best of a given model. Since undoubtedly we have to take decisions in the context of automatic language processing (and speech recognition), it can only be a rhetoric question of whether we should use statistical decision theory at all. To make a comparison with another field: in constructing a power plant, it would be foolish to ignore the principles of thermodynamics! As to the search problem, the most successful strategies are based on either stack decoding or A/ search and dynamic programming beam search. For comparison, in speech recognition, over the last few years, there has been a lot of progress in structuring the search process to generate a compact word lattice or word graph. To make this point crystal clear: The characteristic property of the stochastic modelling approach to language translation is not the use of hidden Markov models or hidden alignments. These methods are only the time-honoured methods and successful methods of today. The characteristic property lies in the systematic use of a probabilistic framework for the construction of models, in the statistical training of the free parameters of these models and in the explicit use of a global scoring criterion for the decision making process.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML