XML Viewer - w00-0507

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0507_metho.xml
Size: 10,057 bytes
Last Modified: 2025-10-06 14:07:22
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0507">
  <Title>TransType: a Computer--Aided Translation Typing System</Title>
  <Section position="4" start_page="0" end_page="49" type="metho">
    <SectionTitle>
2 The TransType model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="46" type="sub_section">
      <SectionTitle>
2.1 User Viewpoint
</SectionTitle>
      <Paragraph position="0"> Our interactive translation system is illustrated in figure 1 for an English to French translation.</Paragraph>
      <Paragraph position="1"> It works as follows: a translator selects a sentence and beg!ns typing its translation. After each character typed by the translator, the system displays a proposed completion, which may either be accepted using a special key or rejected by continuing to type. This interface is simple and its performance may be measured by the proportion of characters or keystrokes saved in typing a translation. Note that, throughout this process, the translator remains in control, and the machine must continually adapt its suggestions to the translator's input. This differs from the usual machine translation set-ups where it is the machine that produces the first draft which  * . * . :, * ,, ------~ ..... Fich:ier :::= ptions * :'. &amp;quot; - :.'. ... &amp;quot; ..: .... &amp;quot;.11... It&amp;quot; I: am:pleased:to: takepart:m this debate today. Usingitoday'S technologies,it:is possiblefOrall ~adiaqs to .... . a ..... borrowing:.</Paragraph>
      <Paragraph position="2"> ..... * ......... :::i~ : d~batl..</Paragraph>
      <Paragraph position="3"> GraCel ~i~la t~chnOIogiemoderne, tousles Can adiehs peuVent se  screen. The target text is typed in the bottom half with suggestions given by the menu at the insertion point.</Paragraph>
      <Paragraph position="4"> then has to be corrected by the translator.</Paragraph>
      <Paragraph position="5"> The first version of TRANSTYPE (Foster et al., 1997) only proposed completions for the current word. We are now working on predictions which extend to the next several words in the text. The potential gain from multiple-word predictions (Langlais et al., 2000) can be appreciated in the one-sentence translation task reported in table 1, where a hypothetical user saves over 60% of the keystrokes needed to produce a translation in a word completion scenario, and about 75% in a &amp;quot;unit&amp;quot; completion</Paragraph>
    </Section>
    <Section position="2" start_page="46" end_page="49" type="sub_section">
      <SectionTitle>
scenario
2.2 System Viewpoint
</SectionTitle>
      <Paragraph position="0"> The core of TRANSTYPE is a completion engine which comprises two main parts: an evaluator which assigns probabilistic scores to completion  This bill is very similar to its companion bill which we dealt with yesterday in the house of commons word-completion task. unit-completion task pref. completions pref. completions</Paragraph>
      <Paragraph position="2"> 106 char. 23 20 accept. 14 11 accept. -t- 1 correc.</Paragraph>
      <Paragraph position="3"> 43 keystrokes 26 keystrokes Table h A one-sentence session illustrating the word- and unit- completion tasks. The first column indicates the target words the user is expected to produce. The next two columns indicate respectively the prefixes typed by the user and the completions made by the system under a word-completion task. The last two columns provide the same information for the unit-completion task. The total number of keystrokes for both tasks is reported in the last line. + indicates the acceptance key typed by the user. A Completion is denoted by a/fl where a is the typed prefix and fl the completed part. Completions for different prefixes are separated by * .</Paragraph>
      <Paragraph position="4"> hypotheses and a generator which uses the evaluation function to select the best candidate for completion.</Paragraph>
      <Paragraph position="5">  The evaluator is a function p(t\[t', s) which assigns to each target-text unit t an estimate of its probability given a source text s and the tokens t' which precede t in the current translation of s. Our approach to modeling this distribution is based to a large extent on that of the IBM group (Brown et al., 1993), but it diflhrs in one significant aspect: whereas the IBM model involves a &amp;quot;noisy channel&amp;quot; decomposition, we use a linear combination of separate predictions from a language model p(t\[t') and a translation model p(t\[s). Although the noisy channel technique is powerful, it has the disadvantage that p(s\[t', t) is more expensive to compute than p(t\[s) when using IBM-style translation models.</Paragraph>
      <Paragraph position="6"> Since speed is crucial for our application, we chose to forego it in the work described here.</Paragraph>
      <Paragraph position="7"> Our linear combination model is fully described in (Langlais and Foster, 2000) but can be seen as follows:</Paragraph>
      <Paragraph position="9"> where .~(O(t',s)) e \[0,1\] are context-dependent interpolation coefficients. O(t~,s) stands for any function which maps t~,s into a set of equivalence classes. Intuitively, ),(O(t r, s)) should be high when s is more informative than t r and low otherwise. For example, the translation model could have a higher weight at the start of sentence but the contribution of the language model can become more important in the middle or the end of the sentence.</Paragraph>
      <Paragraph position="10">  We experimented with various simple linear combinations of four different French language models: a cache model, similar to the cache component in Kuhn's model (Kuhn and Mori, 1990); a unigram model; a trielass model (Derouault and Merialdo, 1986); and an interpolated trigram (Jelinek, 1990).</Paragraph>
      <Paragraph position="11"> We opted for the trigram, which gave significantly better results than the other three models. The trigram was trained on the Hansard corpus (about 50 million words), with 75% of the corpus used for relative-frequency parameter estimates, and 25% used to reestimate interpolation coefficients.</Paragraph>
      <Paragraph position="12">  Our translation model is based on the linear interpolation given in equation 2 which combines predictions of two translation models -- Ms and Mu -- both based on an IBM-like model 2 (see equation 3). Ms was trained on single words and Mu was trained on both words and units.</Paragraph>
      <Paragraph position="13"> p( tls) = Z pt( tls) ,+ (1 - Z).p2 ( (s ) ) word unit (2) where Ps and Pu stand for the probabilities given respectively by Ms and M~. ~(s) represents the new sequence of tokens obtained after grouping the tokens of s into units.</Paragraph>
      <Paragraph position="14"> Both models are based on IBM translation model 2 (Brown et al., 1993) which has the  property that it generates tokens independently. The total probability of the ith target-text token ti is just the average of the probabilities with which it is generated by each source text token sj; this is a weighted average that takes the distance from the generating token into account: null</Paragraph>
      <Paragraph position="16"> where p(ti Is j) is a word-for-word translation probability, Isl is the length (counted in tokens) ofthe source segment s under translation, and a(jli , Is\]) is the a priori alignment probability that the target-text token at position i will be generated by the source text token at position j; this is equal to a constant value of 1~(Is I + 1) for model 1. This formula follows the convention of (Brown et al., 1993) in letting so designate the null state. We modified IBM model 2 to account for invariant entities such as English forms that almost invariably translate into French either verbatim or after having undergone a predictable transformation e.g. numbers or dates. These forms are very frequent in the Hansard corpus.</Paragraph>
    </Section>
    <Section position="3" start_page="49" end_page="49" type="sub_section">
      <SectionTitle>
2.3 The Generator
</SectionTitle>
      <Paragraph position="0"> The task of the generator is to identify units matching the current prefix typed by the user, and pick the best candidate using the evaluation function. Given the real time constraints of an IMT system, we divided the French vocabulary into two parts: a small active component whose contents are always searched for a match to the current prefix, and a much larger passive part which comes into play only when no candidates are found in the active vocabulary. Both vocabularies are coded as tries.</Paragraph>
      <Paragraph position="1"> The passive vocabulary is a large dictionary containing over 380,000 word forms. The active part is computed dynamically when a new sentence is selected by the translator. It relies on the fact that a small number of words account for most of the tokens in a text. It is composed of a few entities (tokens and units) that are likely to appear in the translation. In practice, we found that keeping 500 words and 50 units yields good performance.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="49" end_page="49" type="metho">
    <SectionTitle>
3 Implementation
</SectionTitle>
    <Paragraph position="0"> From an implementation point of view, the core of TransType relies on a flexible object oriented architecture, which facilitates the integration of any model that can predict units (words or sequence of words) from what has been already typed and the source text being translated. This part is written in C/+. Statistical translation and language models have been integrated among others into this architecture (Lapalme et al., 2000).</Paragraph>
    <Paragraph position="1"> The graphical user interface is implemented in Tcl/Tk, a multi-platform script language well suited to interfacing problems. It offers all the classical functions for text edition plus a pop-up menu which contains the more probable words or sequences of words that may complete the ongoing translation. The proposed completions are updated after each keystroke the translator enters.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML