File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-1020_evalu.xml

Size: 5,425 bytes

Last Modified: 2025-10-06 13:58:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1020">
  <Title>User-Friendly Text Prediction for Translators</Title>
  <Section position="7" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
6 Evaluation
</SectionTitle>
    <Paragraph position="0"> We evaluated the predictor for English to French translation on a section of the Canadian Hansard corpus, after training the model on a chronologically earlier section. The test corpus consisted of 5,020 sentence pairs and approximately 100k words in each language; details of the training corpus are given in (Foster, 2000b).</Paragraph>
    <Paragraph position="1"> To simulate a translator's responses to predictions, we relied on the user model, accepting probabilistically according to p(ajx;h;s;k), determining the associated benefit using B(x;h;s;k;a), and advancing the cursor k characters in the case of an  Numbers give % reductions in keystrokes.</Paragraph>
    <Paragraph position="2"> acceptance, 1 otherwise. Here k was obtained by comparing x to the known x from the test corpus.</Paragraph>
    <Paragraph position="3"> It may seem artificial to measure performance according to the objective function for the predictor, but this is biased only to the extent that it misrepresents an actual user's characteristics. There are two cases: either the user is a better candidate--types more slowly, reacts more quickly and rationally-than assumed by the model, or a worse one. The predictor will not be optimized in either case, but the simulation will only overestimate the benefit in the second case. By being conservative in estimating the parameters of the user model, we feel we have minimized the number of translators who would fall into this category, and thus can hope to obtain realistic lower bounds for the average benefit across all translators.</Paragraph>
    <Paragraph position="4"> Table 2 contains results for two different translation models. The top portion corresponds to the MEMD2B maximum entropy model described in (Foster, 2000a); the bottom portion corresponds to the linear combination of a trigram and IBM 2 used in the TransType experiments (Langlais et al., 2002).</Paragraph>
    <Paragraph position="5"> Columns give the maximum permitted number of words in predictions. Rows show different predictor configurations: fixed ignores the user model and makes fixedM-word predictions; linear uses the linear character-probability estimates described in section 3.1; exact uses the exact character-probability calculation; corr is described below; and best gives an upper bound on performance by choosing m in step 3 of the search algorithm so as to maximize B(x;h;s;k) using the true value of k.</Paragraph>
    <Paragraph position="6"> Table 3 illustrates the effects of different components of the user model by showing results for simulated users who read infinitely fast and accept only predictions having positive benefit (superman); who read normally but accept like superman (rational); and who match the standard user model (real). For each simulation, the predictor optimized benefits for the corresponding user model.</Paragraph>
    <Paragraph position="7"> Several conclusions can be drawn from these results. First, it is clear that estimating expected benefit is a much better strategy than making fixed-wordlength proposals, since the latter causes an increase in time for all values of M. In general, making &amp;quot;exact&amp;quot; estimates of string prefix probabilities works better than a linear approximation, but the difference is fairly small.</Paragraph>
    <Paragraph position="8"> Second, the MEMD2B model significantly out-performs the trigram+IBM2 combination, producing better results for every predictor configuration tested. The figure of -11.5% in bold corresponds to the TransType configuration, and corroborates the validity of the simulation.3 Third, there are large drops in benefit due to reading times and probabilistic acceptance. The biggest cost is due to reading, which lowers the best possible keystroke reduction by almost 50% for M = 5.</Paragraph>
    <Paragraph position="9"> Probabilistic acceptance causes a further drop of about 15% for M = 5.</Paragraph>
    <Paragraph position="10"> The main disappointment in these results is that performance peaks at M = 3 rather than continuing to improve as the predictor is allowed to consider longer word sequences. Since the predictor knows B(x;h;s;k), the most likely cause for this is that the estimates for p(^wmjh;s) become worse with increasing m. Significantly, performance lev3Although the drop observed with real users was greater at about 20% (= 17% reduction in speed), there are many differences between experimental setups that could account for the discrepancy. For instance, part of the corpus used for the TransType trials was drawn from a different domain, which would adversely affect predictor performance.</Paragraph>
    <Paragraph position="11"> els off at three words, just as the search loses direct contact with h through the trigram. To correct for this, we used modified probabilities of the form m p(^wmjh;s), where m is a length-specific correction factor, tuned so as to optimize benefit on a cross-validation corpus. The results are shown in the corr row of table 2, for exact character-probability estimates. In this case, performance improves with M, reaching a maximum keystroke reduction of 12:6% at M = 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML