File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0413_metho.xml

Size: 17,640 bytes

Last Modified: 2025-10-06 14:08:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0413">
  <Title>Confidence Estimation for Translation Prediction</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 Confidence Estimation with Neural Nets
</SectionTitle>
    <Paragraph position="0"> Our approach for CE consists in training neural nets to estimate the conditional probability of correctness p(C =</Paragraph>
    <Paragraph position="2"> most probable prediction of length m from a n-best set of alternative predictions according to the base model. In our experiments the prediction length m varies between 1 and 4 and n is at most 5. As the n-best predictions</Paragraph>
    <Paragraph position="4"> g are themselves a function of the context, we will simply note the conditional probability of correctness by p(C =1j^w m ; h; s).</Paragraph>
    <Paragraph position="5"> We experimented with two types of neural nets: single-layer perceptrons (SLPs) and multi-layer perceptrons (MLPs) with 20 hidden units. For both, we used a softmax activation function and gradient descent training with a negative log-likelihood error function. Given suitably-behaved class-conditional feature distributions, this setup is guaranteed to yield estimates of the true posterior probabilities p(C =1j^w</Paragraph>
    <Paragraph position="7"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Single Layer Neural Nets and Maximum
Entropy Models
</SectionTitle>
      <Paragraph position="0"> It is interesting to note the relation between the SLP and maximum entropy models. For the problem of estimating p(yjx) for a set of classes y over a space of input vectors x, a single-layer neural net with &amp;quot;softmax&amp;quot; outputs takes the form:</Paragraph>
      <Paragraph position="2"> is a vector of weights for class y, b is a bias term, and Z(x) is a normalization factor, the sum over all classes of the numerator. A maximum entropy model is a generalization of this in which an arbitrary feature function f y (x) is used to transform the input space as a function of y:</Paragraph>
      <Paragraph position="4"> (x))=Z(x): Both models are trained by maximum likelihood methods. Given C classes, the maximum entropy model can simulate a SLP by dividing its weight vector into C blocks, each the size of x, then using f</Paragraph>
      <Paragraph position="6"> is a vector of 0's and the final 1 yields a bias term.</Paragraph>
      <Paragraph position="7"> The advantage of maximum-entropy models is that their features can depend on the target class. For natural-language applications where target classes correspond to words, this produces an economical and powerful representation. However, for CE, where the output is binary (correct or incorrect), this capacity is less interesting. In fact, there is no a priori reason to use a different set of features for correct outputs or incorrect ones, so the natural form of a maxent model for this problem is identical to a SLP (modulo a bias term). Therefore the experiments we describe below can be seen as a comparison between maxent models and neural nets with a hidden layer.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.2 Confidence Features
</SectionTitle>
      <Paragraph position="0"> The features we use can be divided into three families: ones designed to capture the intrinsic difficulty of the source sentence s (for any NLP task); ones intended to reflect how hard s is to translate in general, and ones intended to reflect how hard s is for the current model to translate. For the first two families, we used two sets of values: static ones that depend on s; and dynamic ones that depend on only those words in s that are deemed to be still untranslated, as determined by an IBM2 word alignment between s and h. The features are: + family 1: trigram perplexity, minimum trigram word probability, average word frequency, average word length, and number of words; + family 2: average number of translations per source word (according to an independent IBM1), average IBM1 source word entropy, number of source tokens still to be translated, number of unknown source tokens, ratio of linked to unlinked source words within the aligned region of the source sentence, and length of the current target-text prefix; and + family 3: average number of search hypotheses pruned (ie outside the beam) per time step, final search lattice size, active vocabulary size (number of target words considered in the search), number of nbest hypotheses, rank of current hypothesis, probability ratio of best hypothesis to sum of top 5 hypotheses, and base model probability of current prediction. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> Evaluation is performed using test sets of translation predictions, each tagged as correct or incorrect. A translation prediction w m is tagged as correct if and only if an identical word sequence is found in the reference translation, properly aligned. This reflects our application, where we attempt to match what a particular translator has in mind, not simply produce any correct translation. We use two types of evaluation methods: ROC curves and a user simulation as described above.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.1 ROC curves
</SectionTitle>
      <Paragraph position="0"> Consider a set of tokens t</Paragraph>
      <Paragraph position="2"> probability, it can range over any real interval [a; b].</Paragraph>
      <Paragraph position="3"> Given a rejection threshold 2 [a; b], any token t</Paragraph>
      <Paragraph position="5"> ) &lt; and it is accepted otherwise.</Paragraph>
      <Paragraph position="6"> The correct acceptence rate CA( ) of a threshold over D is the rate of correct tokens t</Paragraph>
      <Paragraph position="8"> Similarly, the correct rejection rate CR( ) is the rate of false tokens t</Paragraph>
      <Paragraph position="10"> called the ROC curve of s over D. The discrimination capacity of s is given by its capacity to distinguish correct from false tokens. Consequently, a perfect ROC curve would describe the square (0; 1); (1; 1); (1; 0).</Paragraph>
      <Paragraph position="11"> This is the case whenever there exists a threshold 2 [a; b] that separates all correct tokens in D from all the false ones, meaning that the score ranges of correct, respectively false, tokens don't overlap. The worst case scenario, describing a scoring function that is completely irrelevent for correct/false discrimination, corresponds to the diagonal (0; 1); (1; 0). Note that the inverse of the ideal ROC curve, the plot overlapping the axes (1; 0); (0; 0); (1; 0) is equivalent to its inverse from a discrimination capacity point of view: it suffices to invert the rejection algorithm by accepting all tokens that have a score inferior to the rejection threshold.</Paragraph>
      <Paragraph position="12"> In our setting, the tokens are the ^w m translation predictions and the score function is the conditional probability</Paragraph>
      <Paragraph position="14"> In order to easily compare the discrimination capacity of various scoring functions we use a raw measure, the integral of the ROC curve, or IROC. A perfect ROC curve will have an IROC = 1:0 (respectively 0:0 in the inverse case). The worst case scenario corresponds to an IROC of 0:5. We also compare various scoring functions by fixing an operational point at CA =0:80 and observing the corresponding CR values.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1" end_page="2" type="metho">
    <SectionTitle>
5 Experimental Set-up
</SectionTitle>
    <Paragraph position="0"> The data for our experiments originates from the Hansard English-French parallel corpus. In order to generate the train and test sets, we use 1.3 million (900000 for training and 400000 for testing purposes) translation predictions for each fixed prediction length of one, two, three and four words, summing to a total of 5.2 million prediction examples. Each original SMT model experiment was combined with two different CE model architectures: MLPs with one hidden layer containing 20 hidden units and SLP (sometimes also referred to as MLPs with 0 hidden units). Moreover, for each (native model, CE model architecture)-pair, we train five separate CE models: one  the Bayes prediction model probability and the CE of the corresponding SLP and MLP on predictions of up to four words for each fixed prediction length of one, two, three or four words, and an additional model for variable prediction lengths of up to four words.</Paragraph>
  </Section>
  <Section position="7" start_page="2" end_page="2" type="metho">
    <SectionTitle>
6 ROC Evaluations
</SectionTitle>
    <Paragraph position="0"> In this section we report the ROC evaluation results. The user-model evaluation results are presented in the following section.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
6.1 CE and Native SMT Probabilites
</SectionTitle>
      <Paragraph position="0"> The first question we wish to address is whether we can improve the correct/false discrimination capacity by using the propability of correctness estimated by the CE model instead of the native probabilites.</Paragraph>
      <Paragraph position="1"> For each SMT model we compare the ROC plots, IROC and CA/CR values obtained by the native probability and the estimated probability of correctness output by the corresponding SLPs (also noted as mlp-0-hu) and the 20 hidden units MLPs on the one-to-four word prediction task.</Paragraph>
      <Paragraph position="2"> Results obtained for various length predictions of up to four words using the Bayes models are summarized in figure (1)and in table 1 below, and are encouraging. At a fixed CA of 0:80 we obtain CR increases from 0:6604 for the native probability to 0:7211 for the SLP and 0:7728 for the MLP. The over-all gain is also evident from the the relative improvements in IROC obtained by the SLP and MLP models over the native probability, that are respectively 17:06% and 33:31%. These results are quite significant.</Paragraph>
      <Paragraph position="3"> Note that the improvements obtained in the fixed-length 4-word-prediction tasks with the Bayes model (figure (2) and table 2) model are even larger: the relative improvements on IROC are 32:36% and 50:07% for the SLP and the MLP, respectively.</Paragraph>
      <Paragraph position="4"> However, the results obtained in the Maxent models are much less positive: the SLP CR actually drops, while the MLP CR only increases slightly to a 4:80% relative  Training and testing of the neural nets was done using the open-source Torch toolkit ((Collobert et al., 2002), http://www.torch.ch/), which provides efficient C++ implementations of many ML algorithms.</Paragraph>
      <Paragraph position="5">  the Maxent1 prediction model probability and the CE of the corresponding SLP and MLP on predictions of up to four words  the Maxent2 prediction model probability and the CE of the corresponding SLP and MLP on predictions of up to four words improvement in the CR rate for the Maxent1 model ( table 3) and only 3:9% for the Maxent2 model ( table 4). The results obtained with the two Maxent models are very similar. We therefore only draw the ROC curve for the Maxent2 model (figure (3).</Paragraph>
      <Paragraph position="6"> It is interesting to note that the native model prediction accuracy didn't affect the discrimination capacity of the corresponding probability of correctness of the CE models. This result is illustrated in table below, where %C =1is the percentage of correct predictions. Even though the Bayes' model accuracy and IROC is significantly lower then the Maxent model's, the CE IROC values are almost identical.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
6.2 Relevance of Confidence Features
</SectionTitle>
      <Paragraph position="0"> We investigated the relevance of different confidence features by using the IROC values of single-feature models for the 1-4 word prediction task, with both Maxent1 and Bayes base models.</Paragraph>
      <Paragraph position="1"> The group of features that performs best over both models are the model- and search-dependent features described above, followed by the features that capture the intrinsic difficulty of the source sentence and the targetprefix. Least valuable are the remaining features that capture translation difficulty. The single most significant feature is native probability, followed by the probability ratio of the best hypothesis, and the prediction length.</Paragraph>
      <Paragraph position="2"> Somewhat unsurprisingly, the weaker Bayes models are much more sensitive to longer translations than the Max-ent models.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
6.3 Dealing with predictions of various lengths
</SectionTitle>
      <Paragraph position="0"> We compared different approaches for dealing with various length predictions: we trained four separate MLPs for fixed length predictions of one through four words; and a single MLP over predictions of varying lengths. Results are given in table 5 and figure (4)</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="2" end_page="2" type="metho">
    <SectionTitle>
7 Model Combination
</SectionTitle>
    <Paragraph position="0"> In this section we describe how various model combinations schemes affect prediction accuracy. We use the Bayes and the Maxent2 prediction models: we try to exploit the fact that these two models, being fundamentally different, tend to be complementary in some of their responses. The CE models we use are the corresponding MLPs, as they clearly outperform the SLPs. The results presented in table 6 are reported on the variable-length prediction task for up to four words.</Paragraph>
    <Paragraph position="1"> The combination schemes are the following: we run the two prediction models in parallel and choose one of the proposed prediction hypotheses according to the following voting criteria: + Maximum CE vote: choose the prediction with the</Paragraph>
    <Paragraph position="3"> model compared with combined model accuracy + Maximum native probability vote: choose the prediction with the highest native probability.</Paragraph>
    <Paragraph position="4"> As a baseline comparison, we use the accuracy of the individual native prediction models. Then we compute the maximum gain we can expect with an optimal model combination strategy, obtained by running an &amp;quot;oracle&amp;quot; that always picks the right answer.</Paragraph>
    <Paragraph position="5"> The results are very positive: the maximum CE voting scheme obtains a 29:31% of the maximum possible accuracy gain over the better of the two indiviual models (Maxent2). Moreover, if we choose the maximum native probability vote, the overall accuracy actually drops significantly. These results are a strong motivation for our post-prediction confidence estimation approach: by training an additional CE layer using the same confidence features and training data for different underlying prediction models we obtain more uniform estimates of the probability of correctness.</Paragraph>
  </Section>
  <Section position="9" start_page="2" end_page="2" type="metho">
    <SectionTitle>
8 User-Model Evaluations
</SectionTitle>
    <Paragraph position="0"> As described in section 2, we evaluated the prediction system as a whole by simulating the actions of a translator on a given source text and measuring the gain model base mults SLP MLP best  configurations.</Paragraph>
    <Paragraph position="1"> with a user model. In order to abstract away from approximations made in deriving character-based probabilities p(kjx; h; s) used in the benefit calculation from word-based probabilities, we employed a specialized user model. In contrast to the realistic model described in (Foster et al., 2002b), this assumes that users accept predictions only at the beginnings of words, and only when they are correct in their entirety. To reduce variation further, it also assumes that the user always accepts a correct prediction as soon as it is suggested. Thus the model's estimates of benefit to the user may be slightly overoptimistic: the limited opportunities for accepting and editing must be balanced against the user's inhumanly perfect decision-making. However, its main purpose is not realism but simply to allow for a fair comparison between the base and the CE models.</Paragraph>
    <Paragraph position="2"> Simulations with all three translation models were performed using a 500-sentence test text. At each prediction point, the benefits associated with best predictions of 1-4 words in length were compared to decide which (if any) to propose. The results, in terms of percentages of typing time saved, are shown in table 8: base corresponds to the base model; mults to length-specific probability multipliers tuned to optimize benefit on a held-out corpus; SLP and MLP to CE estimates; and best to using an oracle to pick the length that maximizes benefit.</Paragraph>
    <Paragraph position="3"> Although the CE layer provides no gain over the much simpler probability-multiplier approach for the Bayes model, the gain for both maxent models is substantial, around 10% in relative terms and 25% of the theoretical maximum gain (over the base model) with the MLP and slightly lower with the SLP.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML