File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1020_metho.xml

Size: 17,099 bytes

Last Modified: 2025-10-06 14:08:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1020">
  <Title>User-Friendly Text Prediction for Translators</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Text Prediction Task
</SectionTitle>
    <Paragraph position="0"> In the basic prediction task, the input to the predictor is a source sentence s and a prefix h of its translation (ie, the target text before the current cursor position); the output is a proposed extension x to h. Figure 1 gives an example. Unlike the TransType prototype, which proposes a set of single-word (or single-unit) suggestions, we assume that each prediction consists of only a single proposal, but one that may span an arbitrary number of words.</Paragraph>
    <Paragraph position="1"> As described above, the goal of the predictor is to find the prediction ^x that maximizes the expected s: Let us return to serious matters.</Paragraph>
    <Paragraph position="3"> French translation. s is the source sentence, h is the part of its translation that has already been typed, x is what the translator wants to type, and x is the prediction.</Paragraph>
    <Paragraph position="4"> benefit to the user:</Paragraph>
    <Paragraph position="6"> where B(x;h;s) measures typing time saved. This obviously depends on how much of x is correct, and how long it would take to edit it into the desired text.</Paragraph>
    <Paragraph position="7"> A major simplifying assumption we make is that the user edits only by erasing wrong characters from the end of a proposal. Given a TransType-style interface where acceptance places the cursor at the end of a proposal, this is the most common editing method, and it gives a conservative estimate of the cost attainable by other methods. With this assumption, the key determinant of edit cost is the length of the correct prefix of x, so the expected benefit can be written as:</Paragraph>
    <Paragraph position="9"> where p(kjx;h;s) is the probability that exactly k characters from the beginning of x will be correct, l is the length of x, and B(x;h;s;k) is the benefit to the user given that the first k characters of x are correct.</Paragraph>
    <Paragraph position="10"> Equations (1) and (2) define three main problems: estimating the prefix probabilities p(kjx;h;s), estimating the user benefit function B(x;h;s;k), and searching for ^x. The following three sections describe our solutions to these.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Translation Model
</SectionTitle>
    <Paragraph position="0"> The correct-prefix probabilities p(kjx;h;s) are derived from a word-based statistical translation model. The first step in the derivation is to convert these into a form that deals explicitly with character strings. This is accomplished by noting that p(kjx;h;s) is the probability that the first k characters of x are correct and that the k + 1th character (if there is one) is incorrect. For k&lt;l:</Paragraph>
    <Paragraph position="2"> where xk1 = x1:::xk. If k = l, p(kjx;h;s) = p(xjh;s). Also, p(x01) 1.</Paragraph>
    <Paragraph position="3"> The next step is to convert string probabilities into word probabilities. To do this, we assume that strings map one-to-one into token sequences, so that: p(xk1jh;s) p(v1;w2;:::;wm 1;umjh;s); where v1 is a possibly-empty word suffix, each wi is a complete word, and um is a possibly empty word prefix. For example, if x in figure 1 were evenir aux choses, then x141 would map to v1 = evenir, w2 = aux, and u3 = cho. The one-to-one assumption is reasonable given that entries in our lexicon contain neither whitespace nor internal punctuation.</Paragraph>
    <Paragraph position="4"> To model word-sequence probabilities, we apply the chain rule:</Paragraph>
    <Paragraph position="6"> The probabilities of v1 and um can be expressed in terms of word probabilities as follows. Letting u1 be the prefix of the word that ends in v1 (eg, r in</Paragraph>
    <Paragraph position="8"> where the sum is over all words that start with u1.</Paragraph>
    <Paragraph position="9"> Similarly:</Paragraph>
    <Paragraph position="11"> Thus all factors in (3) can be calculated from probabilities of the form p(wjh;s) which give the likelihood that a word w will follow a previous sequence of words h in the translation of s.1 This is the family of distributions we have concentrated on modeling.</Paragraph>
    <Paragraph position="12"> Our model for p(wjh;s) is a log-linear combination of a trigram language model for p(wjh) and a maximum-entropy translation model for p(wjs), described in (Foster, 2000a; Foster, 2000b). The translation component is an analog of the IBM model 2 (Brown et al., 1993), with parameters that are optimized for use with the trigram. The combined model is shown in (Foster, 2000a) to have significantly lower test corpus perplexity than the linear combination of a trigram and IBM 2 used in the TransType experiments (Langlais et al., 2002). Both models supportO(mJV 3) Viterbi-style searches for the most likely sequence of m words that follows h, where J is the number of tokens in s and V is the size of the target-language vocabulary.</Paragraph>
    <Paragraph position="13"> Compared to an equivalent noisy-channel combination of the form p(t)p(sjt), where t is the target sentence, our model is faster but less accurate. It is faster because the search problem for noisy-channel models is NP-complete (Knight, 1999), and even the fastest dynamic-programming heuristics used in statistical MT (Niessen et al., 1998; Tillmann and Ney, 2000), are polynomial in J--for instance O(mJ4V 3) in (Tillmann and Ney, 2000). It is less accurate because it ignores the alignment relation between s and h, which is captured by even the simplest noisy-channel models. Our model is therefore suitable for making predictions in real time, but not for establishing complete translations unassisted by a human.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Implementation
</SectionTitle>
      <Paragraph position="0"> The most expensive part of the calculation in equation (3) is the sum in (4) over all words in the vocabulary, which according to (2) must be carried out for every character position k in a given prediction x. We reduce the cost of this by performing sums only at the end of each sequence of complete tokens in x (eg, after revenir and revenir aux in the above example). At these points, probabilities for all possible prefixes of the next word are calculated in a 1Here we ignore the distinction between previous words that have been sanctioned by the translator and those that are hypothesized as part of the current prediction.</Paragraph>
      <Paragraph position="1"> single recursive pass over the vocabulary and stored in a trie for later access.</Paragraph>
      <Paragraph position="2"> In addition to the exact calculation, we also experimented with establishing exact probabilities via p(wjh;s) only at the end of each token in x, and assuming that the probabilities of the intervening characters vary linearly between these points. As a result of this assumption, p(kjx;h;s) = p(xk1jh;s) p(xk+11 jh;s) is constant for all k between the end of one word and the next, and therefore can be factored out of the sum in equation (2) between these points.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 User Model
</SectionTitle>
    <Paragraph position="0"> The purpose of the user model is to determine the expected benefit B(x;h;s;k) to the translator of a prediction x whose first k characters match the text that the translator wishes to type. This will depend on whether the translator decides to accept or reject the prediction, so the first step in our model is the following expansion:</Paragraph>
    <Paragraph position="2"> wherep(ajx;h;s;k) is the probability that the translator accepts or rejects x, B(x;h;s;k;a) is the benefit they derive from doing so, and a is a random variable that takes on the values 1 for acceptance and 0 for rejection. The first two quantities are the main elements in the user model, and are described in following sections. The parameters of both were estimated from data collected during the TransType trial described in (Langlais et al., 2002), which involved nine accomplished translators using a prototype prediction tool for approximately half an hour each. In all cases, estimates were made by pooling the data for all nine translators.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Acceptance Probability
</SectionTitle>
      <Paragraph position="0"> Ideally, a model for p(ajx;h;s;k) would take into account whether the user actually reads the proposal before accepting or rejecting it, eg:</Paragraph>
      <Paragraph position="2"> where r is a boolean &amp;quot;read&amp;quot; variable. However, this information is hard to extract reliably from the available data; and even if were obtainable, many of the  cepted versus its gain.</Paragraph>
      <Paragraph position="3"> factors which influence whether a user is likely to read a proposal--such as a record of how many previous predictions have been accepted--are not available to the predictor in our formulation. We thus model p(ajx;h;s;k) directly.</Paragraph>
      <Paragraph position="4"> Our model is based on the assumption that the probability of accepting x depends only on what the user stands to gain from it, defined according to the editing scenario given in section 2 as the amount by which the length of the correct prefix of x exceeds the length of the incorrect suffix: p(ajx;h;s;k) p(aj2k l); wherek (l k) = 2k l is called the gain. For instance, the gain for the prediction in figure 1 would be 2 7 8 = 6. The strongest part of this assumption is dropping the dependence on h, because there is some evidence from the data that users are more likely to accept at the beginnings of words. However, this does not appear to have a severe effect on the quality of the model.</Paragraph>
      <Paragraph position="5"> Figure 2 shows empirical estimates of p(a = 1j2k l) from the TransType data. There is a certain amount of noise intrinsic to the estimation procedure, since it is difficult to determine x , and therefore k, reliably from the data in some cases (when the user is editing the text heavily). Nonetheless, it is apparent from the plot that gain is a useful abstrac- null tion, because the empirical probability of acceptance is very low when it is less than zero and rises rapidly as it increases. This relatively clean separation supports the basic assumption in section 2 that benefit depends on k.</Paragraph>
      <Paragraph position="6"> The points labelled smoothed in figure 2 were obtained using a sliding-average smoother, and the model curve was obtained using two-component Gaussian mixtures to fit the smoothed empirical likelihoods p(gainja = 0) and p(gainja = 1). The model probabilities are taken from the curve at integral values. As an example, the probability of accepting the prediction in figure 1 is about .25.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Benefit
</SectionTitle>
      <Paragraph position="0"> The benefit B(x;h;s;k;a) is defined as the typing time the translator saves by accepting or rejecting a prediction x whose first k characters are correct.</Paragraph>
      <Paragraph position="1"> To determine this, we assume that the translator first reads x, then, if he or she decides to accept, uses a special command to place the cursor at the end of x and erases its last l k characters. Assuming independence from h;s as before, our model is:</Paragraph>
      <Paragraph position="3"> where Ra(x) is the cost of reading x when it ultimately gets accepted (a= 1) or rejected (a= 0), T(x;k) is the cost of manually typing xk1, and E(x;k) is the edit cost of accepting x and erasing to the end of its first k characters.</Paragraph>
      <Paragraph position="4"> A natural unit for B(x;k;a) is the number of keystrokes saved, so all elements of the above equation are converted to this measure. This is straight-forward in the case of T(x;k) and E(x;k), which are estimated as k and l k + 1 respectively--for E(x;k), this corresponds to one keystroke for the command to accept a prediction, and one to erase each wrong character. This is likely to slightly underestimate the true benefit, because it is usually harder to type n characters than to erase them.</Paragraph>
      <Paragraph position="5"> As in the previous section, read costs are interpreted as expected values with respect to the probability that the user actually does read x, eg, assuming 0 cost for not reading, R0(x) = p(r=1jx)R00(x), where R00(x) is the unknown true cost of reading and rejecting x. To determine Ra(x), we measured the average elapsed time in the TransType data from the point at which a proposal was displayed to the point at which the next user action occurred--either an acceptance or some other command signalling a rejection. Times greater than 5 seconds were treated as indicating that the translator was distracted and were filtered out. As shown in figure 3, read times are much higher for predictions that get accepted, reflecting both a more careful perusal by the translator and the fact the rejected predictions are often simply  the whole contents of the TransType menu in the case of rejections, and only the proposal that was ultimately accepted in the case of acceptances.</Paragraph>
      <Paragraph position="6"> tionship between the number of characters read and the time taken to read them, so we used the least-squares lines shown as our models. Both plots are noisy and would benefit from a more sophisticated psycholinguistic analysis, but they are plausible and empirically-grounded first approximations.</Paragraph>
      <Paragraph position="7"> To convert reading times to keystrokes for the benefit function we calculated an average time per keystroke (304 milliseconds) based on sections of the trial where translators were rapidly typing and when predictions were not displayed. This gives an upper bound for the per-keystroke cost of reading-compare to, for instance, simply dividing the total time required to produce a text by the number of characters in it--and therefore results in a conservative estimate of benefit.</Paragraph>
      <Paragraph position="8"> To illustrate the complete user model, in the figure 1 example the benefit of accepting would be 7 2 4:2 = :8 keystrokes and the benefit of rejecting would be :2 keystrokes. Combining these with the acceptance probability of .25 gives an overall expected benefit B(x;h;s;k = 7) for this proposal of 0.05 keystrokes.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Search
</SectionTitle>
    <Paragraph position="0"> Searching directly through all character strings x in order to find ^x according to equation (1) would be very expensive. The fact that B(x;h;s) is non-monotonic in the length of x makes it difficult to organize efficient dynamic-programming search techniques or use heuristics to prune partial hypotheses.</Paragraph>
    <Paragraph position="1"> Because of this, we adopted a fairly radical search strategy that involves first finding the most likely sequence of words of each length, then calculating the benefit of each of these sequences to determine the best proposal. The algorithm is:  1. For each length m = 1:::M, find the best word sequence:</Paragraph>
    <Paragraph position="3"> where u1 and h0 are as defined in section 3.</Paragraph>
    <Paragraph position="4">  2. Convert each ^wm to a corresponding character string ^xm.</Paragraph>
    <Paragraph position="5"> 3. Output ^x = argmaxm B(^xm;h;s), or the empty string if all B(^xm;h;s) are nonpositive. null  predictions of maximum word sequence length M, on a 1.2GHz processor, for the MEMD model. In all experiments reported below, M was set to a maximum of 5 to allow for convenient testing. Step 1 is carried out using a Viterbi beam search. To speed this up, the search is limited to an active vocabulary of target words likely to appear in translations of s, defined as the set of all words connected by some word-pair feature in our translation model to some word in s. Step 2 is a trivial deterministic procedure that mainly involves deciding whether or not to introduce blanks between adjacent words (eg yes in the case of la + vie, no in the case of l' + an). This also removes the prefix u1 from the proposal. Step 3 involves a straightforward evaluation of m strings according to equation (2).</Paragraph>
    <Paragraph position="6"> Table 1 shows empirical search timings for various values of M, for the MEMD model described in the next section. Times for the linear model are similar. Although the maximum times shown would cause perceptible delays for M &gt; 1, these occur very rarely, and in practice typing is usually not noticeably impeded when using the TransType interface, even at M = 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML