File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0413_intro.xml
Size: 5,980 bytes
Last Modified: 2025-10-06 14:01:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0413"> <Title>Confidence Estimation for Translation Prediction</Title> <Section position="3" start_page="1" end_page="1" type="intro"> <SectionTitle> 2 Text Prediction for Translators </SectionTitle> <Paragraph position="0"> The application we are concerned with in this paper is an interactive text prediction tool for translators. The system observes a translator in the process of typing a target text and, after every character typed, has the opportunity to display a suggestion about what will come next, based on the source sentence under translation and the prefix of its translation that has already been typed. The translator may incorporate suggestions into the text if they are helpful, or simply ignore them and keep typing.</Paragraph> <Paragraph position="1"> Suggestions may range in length from 0 characters to the end of the target sentence; it is up to the system to decide how much text to predict in a given context, balancing the greater potential benefit of longer predictions against a greater likelihood of being wrong, and a higher cost to the user (in terms of distraction and editing) if they are wrong or only partially right.</Paragraph> <Paragraph position="2"> Our solution to the problem of how much text to predict is based on a decision-theoretic framework in which we attempt to find the prediction that maximizes the expected benefit to the translator in the current context (Foster et al., 2002b). Formally, we seek:</Paragraph> <Paragraph position="4"> where x is a prediction about what will follow h in the translation of a source sentence s, and B(xjh; s) is the expected benefit in terms of typing time saved.</Paragraph> <Paragraph position="5"> As described in (Foster et al., 2002b), B(^x</Paragraph> <Paragraph position="7"> quantities: the probability p(kjx; h; s) that exactly k characters from the beginning of x are correct, and the benefit B(xjh; s;k) to the translator if this is the case. B(xjh; s;k) is estimated from a model of user behaviour--based on data collected in user trials of the tool--that captures the cost of reading a prediction and performing any necessary editing, as well as the somewhat random nature of people's decisions to accept. Prediction probabilities p(kjx; h; s) are derived from a statistical translation model for p(wjh; s), the probability that some word w will follow the target text h in the translation of a source sentence s.</Paragraph> <Paragraph position="8"> Because optimizing (1) directly is expensive, we use a heuristic search procedure to approximate ^x. For each length m from 1 to a fixed maximum of M (4 in this paper), we perform a Viterbi-like beam search with the translation model to find the sequence of words ^w</Paragraph> <Paragraph position="10"> most likely to follow h. For each such sequence, we form a corresponding character sequence ^x m and evaluate its benefit B(^x m ; h; s). The final output is the prediction ^x m with maximum benefit, or nothing if all benefit estimates are negative. To evaluate the system, we simulate a translator's actions on a given source text, using an existing translation as the text the translator wishes to type, and the user model to determine his or her responses to predictions and to estimate the resulting benefit. Further details are given in (Foster et al., 2002b).</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.1 Translation Models </SectionTitle> <Paragraph position="0"> We experimented with three different translation models for p(wjh; s). All have the property of being fast enough to support real-time searches for predictions of up to 5 words.</Paragraph> <Paragraph position="1"> The first model, referred to as Maxent1 below, is a log-linear combination of a trigram language model with a maximum entropy translation component that is an analog of the IBM translation model 2 (Brown et al., 1993). This model is described in (Foster, 2000). Its major weakness is that it does not keep track of which words in the current source sentence have already been translated, and hence it is prone to repeating previous suggestions. The second model, called Maxent2 below, is similar to Maxent1 but with the addition of extra parameters to limit this behaviour (Foster et al., 2002a).</Paragraph> <Paragraph position="2"> The final model, called Bayes below, is also described in (Foster et al., 2002a). It is a noisy-channel combination of a trigram language model and an IBM model 2 for the source text given target text. This model has roughly the same theoretical predictive capability as Maxent2, but unlike the Maxent models it is not discriminatively trained, and hence its native probability estimates tend to be much worse than theirs.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 Computing Smoothed Conditional Probabilities </SectionTitle> <Paragraph position="0"> In order to calculate the character-based probabilities p(kjx; h; s) required for estimating expected benefit, we need to know the conditional probabilities</Paragraph> <Paragraph position="2"> in the context (h; s). These are derived from correctness estimates obtained from our confidenceestimation layer as follows. As explained below, estimates from the CE layer are in the form p(C =</Paragraph> <Paragraph position="4"> is the most probable prediction of length m according to the base translation model. Define a smoothed joint distribution over predictions of length m as:</Paragraph> <Paragraph position="6"> calculated from the conditional probabilities given by the base model; and</Paragraph> <Paragraph position="8"> is a normalization factor. Then the required smoothed conditional probabilities are estimated from the smoothed joint distributions in a straightforward way:</Paragraph> <Paragraph position="10"/> </Section> </Section> class="xml-element"></Paper>