File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/03/j03-3004_abstr.xml

Size: 47,519 bytes

Last Modified: 2025-10-06 13:42:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="J03-3004">
  <Title>c(c) 2003 Association for Computational Linguistics wEBMT: Developing and Validating an Example-Based Machine Translation System Using the World Wide Web</Title>
  <Section position="2" start_page="0" end_page="437" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> In quite a short space of time, translation memory (TM) systems have become a very useful tool in the translator's armory. TM systems store a set of &lt;source, target&gt; translation pairs in their databases. If a new input string cannot be found exactly in the translation database, a search is conducted for close (or &amp;quot;fuzzy&amp;quot;) matches of the input string, and these are retrieved together with their translations for the translator to manipulate into the final, output translation. From this description, it should be clear that TM systems do not translate: Indeed, some researchers consider them to be little more than a search-and-replace engine, albeit a rather sophisticated one (Macklovitch and Russell 2000).</Paragraph>
    <Paragraph position="1"> We can illustrate this with respect to the TM entries in (1), taken from the Canadian Hansards: (1) a. While most were critical, some contributions were thoughtful and constructive == La plupart ont formul 'e des critiques, mais certains ont fait des observations r 'efl'echies et constructives.</Paragraph>
    <Paragraph position="2"> b. Others were plain meanspirited and some contained errors of fact == D'autres discours comportaient des propos mesquins et m ^eme des erreurs de fait.</Paragraph>
    <Paragraph position="3">  [?] School of Computing, Dublin 9, Ireland. E-mail: away@computing.dcu.ie + School of Computing, Dublin 9, Ireland. E-mail: ngough@computing.dcu.ie  Computational Linguistics Volume 29, Number 3 Consider the new source string in (2): (2) While most were critical, some contributions were plain meanspirited. Despite the fact that this new input in (2) is extremely close to the source strings in the TM entries in (1), no TM system containing just these translation pairs in its database would be able to translate (2); the best they could do would be to identify one or both of the two source sentences in the TM in (1) as fuzzy matches and display these, together with their French translations. The translator would then manipulate the target strings in the TM into the final translation (3):  (3) La plupart ont formul'e des critiques, mais certains ont fait des observations mesquines.</Paragraph>
    <Paragraph position="4"> An alternative translation that might be derived from the TM entries in (1) is that in (4): (4) La plupart ont formul'e des critiques, mais certains comportaient des observations mesquines.</Paragraph>
    <Paragraph position="5">  At all stages in the translation process, therefore, the translators themselves are the integral figures: They are free to accept or reject any suggested matches, they construct the translations, and they may or may not use any translations proposed by the TM system to formulate the translations in the target document. Finally, they are free to insert the translations produced into the TM itself as they see fit: that is, either (3) or (4) could be inserted into the TM with the source string (2), or some other translation, if that were preferred.</Paragraph>
    <Paragraph position="6"> A prerequisite for TM (and example-based machine translation [EBMT]) applications is a parallel corpus aligned at sentential level. Such a corpus may be presented to translators en bloc, or translators may help construct it themselves. Here too the translator maintains a large degree of autonomy: Using a tool such as Trados WinAlign, for example, he or she may manually overwrite some of the aligner's decisions by linking &lt;source, target&gt; sentence pairs using the graphical interface provided. Nevertheless, TM systems are currently falling far short of their potential, given the limitation that the smallest accessible translation units are &lt;source, target&gt; strings aligned only at sentential level. Consider the fuzzy matching operation, for instance: Translators are able to set a fuzzy match threshold below which no translation pairs are proposed by the TM system. If this threshold is set too low, then potentially useful translation pairs will be presented along with a lot of noise, thereby risking that this useful translation information will be obscured (high recall, low precision); if it is set too low, then good matches will be presented, but potentially useful matches will not be (low recall, high precision). We noted above that faced with the new input in (2), a TM system might be able to present the translator with the fuzzy matches in (1). However, if a translator were to set the level of fuzzy matching at 80% (a not unreasonable level), then neither of the translation pairs in (1) would be deemed to be a suitably good fuzzy match, as only 7/9 (77%) of the words in (1a) match those in (2) exactly, and only 3/9 (33%) of the words in (1b) match those in (2) exactly. Indeed, setting an appropriate fuzzy match level is such a difficult problem that some translators switch off this option and use the TM only to find exact matches. If subsentential alignment could be integrated into the TM databases, more useful fragments could be put at the disposal of the translator. If we could fragment the  Way and Gough wEBMT sententially aligned TM examples in (1) so that subsentential chunks were displayed to the user, then the chance of finding exact matches or good fuzzy matches would increase considerably. This is currently beyond the scope of TM systems.</Paragraph>
    <Paragraph position="7"> In contrast, EBMT systems have overcome this constraint by storing subsentential translational correspondences in addition to the sententially aligned pairs from which they are derived. As a consequence, where a TM system can only propose a number of close-scoring matches in its database for the translator to adapt into the final translation, an EBMT system can produce translations itself by automatically combining chunks from different translation examples stored in its memories.</Paragraph>
    <Paragraph position="8"> In Section 2, we describe how we automatically obtain a hierarchy of lexical resources that are used sequentially by our EBMT system, wEBMT, to translate new input. The primary resource gathered is a &amp;quot;phrasal lexicon,&amp;quot; constructed by extracting over 200,000 phrases from the Penn Treebank and having them translated into French by three Web-based machine translation (MT) systems.</Paragraph>
    <Paragraph position="9"> Each set of translations is stored separately, and for each set the &amp;quot;marker hypothesis&amp;quot; (Green 1979) is used to segment the phrasal lexicon into a &amp;quot;marker lexicon.&amp;quot; The marker hypothesis is a universal psycholinguistic constraint which states that natural languages are &amp;quot;marked&amp;quot; for complex syntactic structure at surface form by a closed set of specific lexemes and morphemes. That is, a basic phrase-level segmentation of an input sentence can be achieved by exploiting a closed list of known marker words to signal the start and end of each segment.</Paragraph>
    <Paragraph position="10"> Consider the following example, selected at random from the Wall Street Journal section of the Penn-II Treebank: (5) The Dearborn, Mich., energy company stopped paying a dividend in the third quarter of 1984 because of troubles at its Midland nuclear plant.</Paragraph>
    <Paragraph position="11"> Here we see that three noun phrases start with determiners and one with a possessive pronoun. The sets of determiners and possessive pronouns are both very small. Furthermore, there are four prepositional phrases, and the set of prepositions is similarly small. A further assumption that could be made is that all words that end with -ed are verbs, such as stopped in (5). The marker hypothesis is arguably universal in presuming that concepts and structures like these have similar morphological or structural marking in all languages.</Paragraph>
    <Paragraph position="12"> The marker hypothesis has been used for a number of different language-related tasks, including  and Hearne 2002) With respect to translation, a potential problem in using the marker hypothesis is that some languages do not have marker words such as articles, for instance. Green's (1979) work showed that artificial languages, both with and without specific marker words, may be learned more accurately and quickly if such psycholinguistic cues exist. The  Computational Linguistics Volume 29, Number 3 research of Mori and Moeser (1983) showed a similar effect due to case marking on pseudowords in such artificial languages, and Morgan, Meier, and Newport (1989) demonstrated that languages that do not permit pronouns as substitutes for phrases also provide evidence in favor of the marker hypothesis. Juola's (1994, 1998) work on grammar optimization and induction shows that context-free grammars can be converted to &amp;quot;marker-normal form.&amp;quot; However, marker-normal form grammars cannot capture the sorts of regularities demonstrated for languages that do not have a one-to-one mapping between a terminal symbol and a word. Nevertheless, Juola (1998, page 23) observes that &amp;quot;a slightly more general mapping, where two adjacent terminal symbols can be merged into a single lexical item (for example, a word and its case-marking), can capture this sort of result quite handily.&amp;quot; Work using the marker hypothesis for MT adapts this monolingual mapping for pairs of languages: It is reasonably straightforward to map an English determiner-noun sequence onto a Japanese noun-case marker segment, once one has identified the sets of marker tags in the languages to be translated.</Paragraph>
    <Paragraph position="13"> Following construction of the marker lexicon, the &lt;source, target&gt; chunks are generalized further using a methodology based on Block (2000) to permit a limited form of insertion in the translation process. As a byproduct of the chosen methodology, we also derive a standard &amp;quot;word-level&amp;quot; translation lexicon. These various resources render the set of original translation pairs far more useful in deriving translations of previously unseen input.</Paragraph>
    <Paragraph position="14"> In Section 3, we describe in detail the segmentation process, together with the procedure whereby target chunks are combined to produce candidate translations. In Section 4, we report initially on two experiments in which we test different versions of our EBMT system against test sets of NPs and sentences. We then conduct a set of further experiments which show that using the resources developed from more than one on-line MT system may improve both translation coverage and quality. Furthermore, seeding the system databases with more fragments improves translation quality. In addition, we calculate the net gain of our EBMT system by comparing translation quality against that of the three on-line MT systems. Finally, we comment on the relative strengths and weaknesses of the three on-line MT systems used.</Paragraph>
    <Paragraph position="15"> Like most EBMT systems, our approach suffers from the problem of &amp;quot;boundary friction&amp;quot;: where chunks from different translation examples are recombined, the quality of the resulting translations may be compromised. Assume that the aligned examples in (6) are located in the system database:  (6) a. You can attach a phone to the connector == Vous pouvez r 'elier un t'el 'ephone au connecteur.</Paragraph>
    <Paragraph position="16"> b. Connect only the keyboard and a mouse == Connectez uniquement le clavier et une souris.</Paragraph>
    <Paragraph position="17"> Let us now confront the EBMT system with the new input string in (7): (7) You can attach a mouse to the connector.</Paragraph>
    <Paragraph position="18"> This could be correctly translated by the EBMT system by isolating the useful translation fragments in (8): (8) a. You can attach == Vous pouvez r 'elier (from (6a)) b. a mouse == une souris (from (6b)) c. to the connector == au connecteur (from (6a))  Way and Gough wEBMT Recombining the French chunks gives us the correct translation in (9): (9) Vous pouvez r'elier une souris au connecteur.</Paragraph>
    <Paragraph position="19"> However, a number of mistranslations could also ensue, including those in (10): (10) a. *Vous pouvez r'elier un souris au connecteur.</Paragraph>
    <Paragraph position="20"> b. *Vous pouvez r'elier un souris au le connecteur.</Paragraph>
    <Paragraph position="21"> The mistranslation (10a) could be formed via the set of inferences in (11): (11) You can attach a == Vous pouvez r 'elier un (from (6a)) mouse == souris (from (6b)) to the connector == au connecteur (from (6a)) The mistranslation (10b) could be formed via the set of inferences in (12): (12) You can attach a == Vous pouvez r 'elier un (from (6a)) mouse == souris (from (6b)) to == au (from (6a)) the == le (from (6b)) connector == connecteur (from (6a)) It is clear, therefore, that unless the process by which the original &lt;source, target&gt; sentence pairs are fragmented is well defined and strictly controlled, chunks may be combined from different contexts that result in agreement errors such as those in (10).</Paragraph>
    <Paragraph position="22">  Depending on the input string, our wEBMT system may generate thousands of candidate translations, including many mistranslations like those in (10). A major advantage of MT systems based on probabilities is that output translations can be ranked (and pruned, if required): One would hope that such systems would rank good translations such as that in (9) more highly than poor ones such as those in (10). We demonstrate that in almost all experiments, our EBMT system consistently ranks the &amp;quot;best&amp;quot; translation in the top 10 output translations, and always in the top 1% of the translations generated.</Paragraph>
    <Paragraph position="23"> In order to minimize errors of boundary friction, in Section 5 we develop a novel, post hoc procedure via the World Wide Web to validate and, if necessary, correct translations prior to their being output to the user.</Paragraph>
    <Paragraph position="24">  Finally we conclude and point to areas of future research.</Paragraph>
    <Paragraph position="25"> 1 Note also that with respect to the translations given in (3) and (4), the translator interacting with the TM has used his or her translation knowledge to avoid a problem of boundary friction: Given the TM entries in (1), the translation of plain meanspirited would appear to be mesquins. This is correct in this context, as it co-occurs with a masculine plural noun propos. In translating (2), however, observations is a feminine plural noun, so the adjective mesquines is inserted to maintain agreement throughout the NP. If the translation pair &lt;plain meanspirited, mesquines&gt; were not found in the system's memories, then only the mistranslation observations mesquins could be produced by an EBMT system. 2 One of the areas of boundary friction that we use our post hoc validation procedure to correct is that of subject-verb agreement. Note that with examples such as (18), this is not usually (such) a problem for marker-based approaches to MT as we face here, as verbs are contained within (part of) the same chunk as their subject NPs. However, given that we translate phrases rather than sentences, it is a considerable problem for our approach, yet one that we overcome satisfactorily. In further work, if we were to store the translations of the VPs with their dummy subject NPs in a sentential lexicon and derive all marker lexicons from this database, the problem of subject-verb agreement would be largely overcome.</Paragraph>
    <Paragraph position="26">  2. Deriving Translation Resources from Web-Based MT Systems All EBMT systems, from the initial proposal by Nagao (1984) to the recent collection of Carl and Way (2003), are premised on the availability of subsentential alignments  derived from the input bitext. There is a wealth of literature on trying to establish sub-sentential translations from a bilingual corpus.</Paragraph>
    <Paragraph position="27">  Kay and R&amp;quot;oscheisen (1993) attempt to extract a bilingual dictionary using a hybrid method of sentence and word alignment on the assumption that the &lt;source, target&gt; words have a similar distribution. Fung and McKeown (1997) attempt to translate technical terms using word relation matrices, although the resource from which such relations are derived is a pair of nonparallel corpora. Somers (1998) replicates the work of Fung and McKeown with different language pairs using the simpler metric of Levenshtein distance. Boutsis and Piperidis (1998) use a tagged parallel corpus to extract translationally equivalent English-Greek clauses on the basis of word occurrence and co-occurrence probabilities. The respective lengths of the putative alignments in terms of characters is also an important factor. Ahrenberg, Andersson, and Merkel (2002) observe that for less widely spoken languages, the relative lack of linguistic tools and resources has forced developers of word alignment tools for such languages to use shallow processing and basic statistical approaches to word linking. Accordingly, they generate lexical correspondences by means of co-occurrence measures and string similarity metrics.</Paragraph>
    <Paragraph position="28"> More specifically, the notion of the phrasal lexicon (used first by Becker 1975) has been used successfully in a number of areas:  More recently, Simard and Langlais (2001) have proposed the exploitation of TMs at a subsentential level, while Carl, Way, and Sch&amp;quot;aler (2002) and Sch&amp;quot;aler, Way, and Carl (2003, pages 108-109) describe how phrasal lexicons might come to occupy a central place in a future hybrid integrated translation environment. This, they suggest, may result in a paradigm shift from TM to EBMT via the phrasal lexicon: Translators are on the whole wary of MT technology, but once subsentential alignment is enabled, translators will become aware of the benefits to be gained from &lt;source, target&gt; phrasal segments, and from there they suggest that &amp;quot;it is a reasonably short step to enabling an automated solution via the recombination element of EBMT systems such as those described in [Carl and Way 2003].&amp;quot; In this section, we describe how the memory of our EBMT system is seeded with a set of translations obtained from Web-based MT systems. From this initial resource, we subsequently derive a number of different databases that together allow many new input sentences to be translated that it would not be possible to translate in other systems. First, the phrasal lexicon is segmented using the marker hypothesis to produce a marker lexicon. This is then generalized, following a methodology based on Block (2000), to generate the &amp;quot;generalized marker lexicon.&amp;quot; Finally, as a result of the</Paragraph>
    <Section position="1" start_page="427" end_page="427" type="sub_section">
      <SectionTitle>
Way and Gough wEBMT
</SectionTitle>
      <Paragraph position="0"> methodology chosen, we automatically derive a fourth resource, namely, a &amp;quot;word-level lexicon.&amp;quot;</Paragraph>
    </Section>
    <Section position="2" start_page="427" end_page="428" type="sub_section">
      <SectionTitle>
2.1 The Phrasal Lexicon
</SectionTitle>
      <Paragraph position="0"> Our phrasal lexicon was built by selecting a set of 218,697 English noun phrases and verb phrases from the Penn Treebank. We identified all rule types occurring 1,000 or more times and eliminated those that were not relevant (e.g., rules dealing only with numbers). Where rules contained just a single nonterminal on their right-hand side, only those rules whose left-hand side was VP were retained in order to ensure that we could handle intransitive verbs. In total, 59 rule types out of a total of over 29,000 (i.e., just 0.002% of the rules in Penn-II) were used in creating the various lexical resources. For each of these 59 rule types, the tokens corresponding to the rule right-hand sides were extracted. These extracted English phrases were then translated using</Paragraph>
      <Paragraph position="2"> Translating the NPs via these MT systems was reasonably straightforward. We report in Section 4 on the quality of the French NPs produced, and in Section 5 we discuss experiments designed to discover whether our EBMT system could improve on any mistranslations obtained. Translating the VPs involved a little more thought: In the main, on-line MT systems such as these work far better when they translate sentences. In order to obtain finite verb forms rather than the default infinitival forms, we provided dummy subjects. Initially these were third-person plural pronouns, which caused similar verb forms to be created. This obviously biases the EBMT system more in favor of third-person plural sentences. Nevertheless, using the WWW-based post hoc evaluation methodology proposed in Section 5, we were still able to obtain reasonable translations for non-third-person-plural sentences too. In a subsequent experiment, we seed the databases of wEBMT with third-person singular verb forms by providing third-person singular pronouns as the dummy subjects, and in a final experiment we combine all third-person fragments (both singular and plural) into the system's memories and compare results on the same test set.</Paragraph>
      <Paragraph position="3"> The on-line MT systems were selected purely because they enable batch translation of large quantities of text. In our experience, the most efficient way to translate large amounts of data via on-line MT systems is to send each document as an HTML page with the phrases to be translated encoded as an ordered list. We automatically tagged the English phrases with HTML codes and input them into each translation system using the Unix wget function, which takes a URL as input and writes the corresponding HTML document to a file. If the URL takes the form of a query, then the document retrieved is the result of the query, namely, the translated Web page. Once this is obtained, it is a simple process to retrieve the French translations and associate them with their English source equivalents.</Paragraph>
    </Section>
    <Section position="3" start_page="428" end_page="429" type="sub_section">
      <SectionTitle>
Computational Linguistics Volume 29, Number 3
2.2 The Marker Lexicons
</SectionTitle>
      <Paragraph position="0"> Given that the marker hypothesis is arguably universal, it is clear that benefits may accrue by using it to facilitate subsentential alignment of &lt;source, target&gt; chunks. Juola (1994, 1997) conducts some small experiments using his METLA system to show the viability of this approach for English [?]- French and English [?]- Urdu. For the English [?]- French language pair, Juola gives results of 61% correct translation when the system is tested on the training corpus, and 36% accuracy when it is evaluated with test data. For English [?]- Urdu, Juola (1997, page 213) notes that &amp;quot;the system learned the original training corpus ...perfectly and could reproduce it without errors&amp;quot;; that is, it scored 100% accuracy when tested against the training corpus. On novel test sentences, he gives results of 72% correct translation. In their Gaijin system, Veale and Way (1997) give a result of 63% accurate translations obtained for English [?]- German on a test set of 791 sentences from CorelDRAW manuals.</Paragraph>
      <Paragraph position="1"> As in METLA and Gaijin, we exploit lists of known marker words for each language to indicate the start and end of segments. For English, our source language, we use the sets of marker words in (13):  (13) &lt;DET&gt; {the, a, an, those, these, ...} &lt;PREP&gt; {in, on, out, with, from, to, under, ...} &lt;QUANT&gt; {all, some, few, many, ...} &lt;CONJ&gt; {and, or, ...} &lt;POSS&gt; {my, your, our,...} &lt;PRON&gt; {I, you, he, she, it,...} A similar set (14) was produced for French, the target language in our wEBMT system: (14) &lt;DET&gt; {le, la, l', les, ce, ces, ceux, cet, ...} &lt;PREP&gt; {dans, sur, avec, de, `a, sous, ...} &lt;QUANT&gt; {tous, tout, toutes, certain, quelques, beaucoup, ...} &lt;CONJ&gt; {et, ou, ...} &lt;POSS&gt; {mon, ma, mes, ton, ta, tes, notre, nos, ...} &lt;PRON&gt; {je, j', tu, il, elle, ...}  In a preprocessing stage, the aligned &lt;source, target&gt; pairs in the phrasal lexicon are traversed word by word, and whenever any such marker word is encountered, a new chunk is begun, with the first word labeled with its marker category (&lt;DET&gt;, &lt;PREP&gt;, etc.). The example in (15) illustrates the results of running the marker hypothesis over the source phrase all uses of asbestos:  In addition, we impose a further constraint that each chunk must also contain at least one non-marker word, so that the phrase out in the cold will be viewed as one segment (labeled with &lt;PREP&gt;), rather than split into still smaller chunks.</Paragraph>
      <Paragraph position="3"> &gt; pair, where X is one of the sets of translations derived from the three separate MT on-line systems (see above), we derive separate marker lexicons for each of the 218,697 source phrases and target translations. This gives</Paragraph>
    </Section>
    <Section position="4" start_page="429" end_page="431" type="sub_section">
      <SectionTitle>
Way and Gough wEBMT
</SectionTitle>
      <Paragraph position="0"> us a total of 656,091 &lt;source, target&gt; translation pairs (including many repetitions, of course). Given that English and French have essentially the same word order, these marker lexicons are predicated on the na&amp;quot;ive yet effective assumption that marker-headed chunks in the source S map sequentially to their target equivalents T; that is,</Paragraph>
      <Paragraph position="2"> marker categories matching, where possible. Using the previous example of all uses of asbestos, this gives us the marker chunks in (16):  Sometimes the number of marker chunks in the two languages differs, with respect to both the marker categories and the number of chunks obtained. Consider the example in (17): (17) The man looks at the woman == L'homme regarde la femme. Once the marker hypothesis is applied to (17), it would be marked up as in (18): (18) &lt;DET&gt; The man looks &lt;PREP&gt; at &lt;DET&gt; the woman == &lt;DET&gt; L' homme regarde &lt;DET&gt; la femme.</Paragraph>
      <Paragraph position="3"> That is, the English verb subcategorizes for a PP complement which in this case contains two marker words, whereas the French verb regarder is a straightforward transitive verb. It may appear, therefore, that there are three chunks in the English string and only two on the French side, but this is not the case: The restriction that each segment must contain at least one non-marker word ensures that we have just two marker chunks for the English string in (18). However, it remains the case that the chunks are tagged differently; we obtain the marker chunks in (19):  (19) English: &lt;DET&gt; The man looks &lt;PREP&gt; at &lt;DET&gt; the woman French: &lt;DET&gt; L' homme regarde &lt;DET&gt; la femme  Our alignment method would therefore align the first English chunk with the first French chunk, as their marker categories match. Note, of course, that this contains a translation error: regarde translates not as looks but rather as looks at. Errors such as this will adversely affect translation quality, but as we report in Section 4, good-quality translations are obtainable on the whole. The second pair in (19), however, cannot be mapped straightforwardly onto one another, as the marker categories differ. Nevertheless, our algorithm would align &amp;quot;&lt;DET&gt; the woman&amp;quot; with &amp;quot;&lt;DET&gt; la femme,&amp;quot; as their marker categories match. This ensures that as many potentially useful translation fragments are generated as possible.</Paragraph>
      <Paragraph position="4"> This na&amp;quot;ive alignment procedure works well between (broadly) similar languages such as English and French, but there are cases even between quite closely related languages in which the procedure breaks down. In order to increase translation quality  Computational Linguistics Volume 29, Number 3 still further, the mapping function needs to be improved to account for examples such as (20): (20) The man likes the woman == La femme pla^it `a l'homme.</Paragraph>
      <Paragraph position="5"> The like == plaire case is an argument-switching (or relation-changing) example, in that the subject in English becomes the indirect object in French, and the English object translates as the French subject. If we were to apply the marker hypothesis to (20), we would derive (21):  (21) &lt;DET&gt; The man likes &lt;DET&gt; the woman == &lt;DET&gt; La femme pla^it &lt;PREP&gt; `a &lt;DET&gt; l' homme.</Paragraph>
      <Paragraph position="6"> That is, without recourse to a lexicon or information about the relative distribution of words and their translations, we would derive the marker chunks in (22): (22) a. &lt;DET&gt; The man likes == &lt;DET&gt; La femme pla^it b. &lt;DET&gt; the woman == &lt;DET&gt; l' homme  Of course, both alignments are wrong. However, our alignment method correctly aligns &lt;source, target&gt; segments in approximately 80% of cases. We calculate this as an approximation by testing all translations of marker chunks to see whether these French chunks appear anywhere on the Web: If so, we assume that the translations obtained by the online MT systems are correct. For 39,895 such translations, 75.2% of those produced by system A appear on the Web, with 81.7% of those generated by system B and 81.5% of those produced by system C also appearing on the Web. Note that this gives us only an approximation of the correctness of our alignments, as we are testing whether the French translations are &amp;quot;good French&amp;quot; rather than whether the alignments in which they appear are actually correct.</Paragraph>
      <Paragraph position="7"> Correcting misalignments such as those in (22) is a topic for further research. Adding a bilingual lexicon (our word-level lexicon, for example) and incorporating the constraints contained therein into the marker-based alignment process would prevent chunks such as those in (22) from being generated, and we conjecture that translation quality would improve accordingly.</Paragraph>
      <Paragraph position="8"> Given marker chunks such as those in (16), we are able to extract automatically a further bilingual dictionary, the word-level lexicon. We take advantage of the assumption that where a chunk contains just one non-marker word in both source and target, these words are translations of each other. Where a marker-headed pair contains just two words, as in (16), for instance, we can extract the word-level translations in (23):  That is, using the marker hypothesis method of segmentation, smaller aligned segments can be extracted from the phrasal lexicon without recourse to any detailed parsing techniques or complex co-ocurrence measures.</Paragraph>
    </Section>
    <Section position="5" start_page="431" end_page="432" type="sub_section">
      <SectionTitle>
Way and Gough wEBMT
</SectionTitle>
      <Paragraph position="0"> Juola (1994, 1997) assumes that words ending in -ed are verbs. However, given that verbs are not a closed class, in our approach we do not mark chunks beginning with a verb with any marker category. Instead, we take advantage of the fact that the initial phrasal chunks correspond to rule right-hand sides. That is, for a rule in the Penn Treebank VP [?]- VBG, NP, PP, we are certain (if the annotators have done their job correctly) that the first word in each of the strings corresponding to this right-hand side is a VBG, that is, a present participle. Given this information, in such cases we tag such words with the &lt;LEX&gt; tag. Taking expanding the board to 14 members [?]- augmente le conseil `a 14 membres as an example, we extract the chunks in (24):  We ignore here the trivially true lexical chunk &amp;quot;&lt;QUANT&gt; 14 : 14.&amp;quot; In a final processing stage, we generalize over the marker lexicon following a process found in Block (2000). In Block's approach, word alignments are assigned probabilities by means of a statistical word alignment tool. In a subsequent stage, chunk pairs are extracted, which are then generalized to produce a set of translation templates for each &lt;source, target&gt; segment.</Paragraph>
      <Paragraph position="1"> Block distinguishes chunks from &amp;quot;patterns,&amp;quot; as we do: His chunks are similar to our marker chunks, and his patterns are similar to our generalized marker chunks.</Paragraph>
      <Paragraph position="2"> Once chunks are derived from &lt;source, target&gt; alignments, patterns are computed from the derived chunks by means of the following algorithm: &amp;quot;for each pair of chunk pairs  replaced by V&amp;quot; (Block 2000, pages 414-415). Block then gives an example that shows how patterns are derived. Assume the chunk pairs in (25):  (25) &lt; [das], [which] &gt; &lt; [ist], [is] &gt; &lt; [was], [what] &gt; &lt; [Sie], [you] &gt; &lt; [wollten], [wanted] &gt; &lt; [das ist], [which is] &gt;  &lt; [das ist was], [which is what] &gt; &lt; [das ist was Sie], [which is what you] &gt; &lt; [das ist was Sie wollten], [which is what you wanted] &gt;  Computational Linguistics Volume 29, Number 3 Using the algorithm described above, the patterns in (26) are derived from the chunks in (25): (26) &lt; [V ist], [V is] &gt; &lt; [das V], [which V] &gt; &lt; [das V was], [which V what] &gt;</Paragraph>
      <Paragraph position="4"> Of course, many other researchers also try to extract generalized templates. Kaji, Kida, and Morimoto (1992) identify translationally equivalent phrasal segments and replace such equivalents with variables to generate a set of translation patterns. Watanabe (1993) combines lexical and dependency mappings to form his generalizations.</Paragraph>
      <Paragraph position="5"> Other similar approaches include those of Cicekli and G &amp;quot;uvenir (1996), McTait and Trujillo (1999), Carl (1999), and Brown (2000), inter alia.</Paragraph>
      <Paragraph position="6"> In our system, in some cases the smallest chunk obtainable via the marker-based segmentation process may be something like (27): (27) &lt;DET&gt; the good man : le bon homme In such cases, if our system were confronted with a good man, it would not be able to translate such a phrase, assuming this to be missing from the marker lexicon. Accordingly, we convert examples such as (27) into their generalized equivalents, as in (28): (28) &lt;DET&gt; good man : bon homme That is, where Block (2000) substitutes variables for various words in his templates, we replace certain lexical items with their marker tag. Given that examples such as ''&lt;DET&gt; a : un&amp;quot; are likely to exist in the word-level lexicon, they may be inserted at the point indicated by the marker tag to form the correct translation un bon homme.We thus cluster on marker words to improve the coverage of our system (see Section 5 for results that show exactly how clustering on marker words helps); others (notably Brown [2000, 2003]) use clustering techniques to determine equivalence classes of individual words that can occur in the same context, and in so doing derive translation templates from individual translation examples.</Paragraph>
    </Section>
    <Section position="6" start_page="432" end_page="434" type="sub_section">
      <SectionTitle>
2.3 Summary
</SectionTitle>
      <Paragraph position="0"> In sum, we automatically create four knowledge sources:  Way and Gough wEBMT When matching the input to the corpus, we search for chunks in the order given here, that is, from specific examples (those containing more context) to generic (those containing less context). We give in (29) an example of how a particular sentence from our test set is translated via these different knowledge sources: (29) Input: A major concern for the parent company is what advertisers are paying per page.</Paragraph>
      <Paragraph position="1"> Chunks found in marker lexicon: for the parent company : pour la soci'et'em`ere what advertisers are paying per page : quels annonceurs paient per page Chunk found in generalized marker lexicon:  Given the fragments shown in (29), a translation can now be derived. First, the word pair &amp;quot;&lt;DET&gt; a : une&amp;quot; is inserted into the generalized template &amp;quot;&lt;DET&gt; major concern: inqui'etude majeure&amp;quot; to begin the translation process; the next chunk, &amp;quot;for the parent company : pour la soci'et'em`ere,&amp;quot; is retrieved from the marker lexicon; the missing word pair &amp;quot;&lt;LEX&gt; is : est&amp;quot; is retrieved from the word-level lexicon; and finally, the marker chunk &amp;quot;what advertisers are paying per page : quels annonceurs paient per page&amp;quot; is appended to produce the translation in (30): (30) Une inqui'etude majeure pour la soci'et'em`ere est quels annonceurs paient per page.</Paragraph>
      <Paragraph position="2"> Of course, this &amp;quot;translation&amp;quot; is not without problems: There is a poor (in this instance) translation of what as quels, and a nontranslation of per. There is little our system can do about errors such as these made by the on-line MT systems. Nevertheless, (29) illustrates how the various knowledge sources play a part in determining the final translation in our system.</Paragraph>
      <Paragraph position="3"> Note that none of these aligned resources would be possible in a TM system. The problem of segmentation is not an inconsiderable one in all EBMT systems, but we (and others) have found that using the marker hypothesis can greatly facilitate such a process. We shall show in subsequent sections that because such knowledge sources are derived automatically from the original translations obtained via Web-based MT systems, the translations obtained in our EBMT process are largely of high quality, are ranked highly in the set of output translation candidates, and may be generated in almost all cases--all this despite the fact that the original translations obtained via the Web contain many errors, and that the source phrases to be translated were selected from a mere fraction of the rule types in the Penn-II Treebank.</Paragraph>
      <Paragraph position="4"> 3. Retrieving Chunks and Producing Translations In Section 4, we report on a number of experiments using the resources obtained in the previous section to translate two test sets of data, one a set of NPs and the other  Computational Linguistics Volume 29, Number 3 a set of sentences. Although we are primarily interested in translating sentences, we translate NPs for two reasons: (1) to assure ourselves that we are in fact translating nominal chunks correctly, and (2) to see whether our methodology can actually correct any NPs mistranslated by the three on-line MT systems. In this section, we describe the processes involved in retrieving appropriate chunks and forming translations for NPs only (these being fewer in number than for sentences, of course).</Paragraph>
    </Section>
    <Section position="7" start_page="434" end_page="434" type="sub_section">
      <SectionTitle>
3.1 Segmentation of the Input
</SectionTitle>
      <Paragraph position="0"> In many cases, a 100% match for a given NP cannot be found in the phrasal lexicon.</Paragraph>
      <Paragraph position="1"> In order to try and process the NP in a compositional manner, it is segmented into smaller chunks, and the system then attempts to locate these chunks individually and to retrieve their relevant translation(s) from the various lexicons described above. We use an n-gram-based segmentation method. Initially, we located all possible bigrams, trigrams and so on within the input string and then searched for these within the relevant knowledge sources.</Paragraph>
      <Paragraph position="2"> However, many of these n-grams cannot be found by our system, given that new chunks are placed in the marker lexicon when a marker word is found in a sentence.</Paragraph>
      <Paragraph position="3"> Taking the NP the total at risk a year as an example, chunks such as the total at risk a or at risk a cannot be located, as new chunks would be formed at each marker word (assuming the adjacent word is a non-marker word), so the best that could be expected here might be to find the chunks in (31): (31) &lt;DET&gt; the total, &lt;PREP&gt; at risk, &lt;DET&gt; a year The respective translations of these chunks would then be recombined to form the target string. In a recent addition to our work, we have eliminated certain n-grams (such as those that end in a marker word, for instance) from the search process, as these would never be found given our chosen method of segmentation.</Paragraph>
    </Section>
    <Section position="8" start_page="434" end_page="435" type="sub_section">
      <SectionTitle>
3.2 Retrieving Translation Chunks
</SectionTitle>
      <Paragraph position="0"> We use translations retrieved from the three different on-line MT systems specified above (see Section 2.1). These translations are further broken down using the marker hypothesis to provide us with an additional three knowledge sources A  , a marker lexicon, generalized marker lexicon and word-level lexicon derived from chunks produced by each system. These knowledge sources can be combined in several different ways. We have produced translations using  , that is, phrasal and marker lexicons derived from translations produced by all three on-line systems The objective here is to see how much translation coverage and quality are improved by using chunks derived from multiple sources. Assuming that the English strings are  Way and Gough wEBMT not translated in exactly the same manner by the three on-line MT systems means that more knowledge sources could be combined in attempting to translate the new input contained in the test sets of noun phrases and sentences. Results from experiments conducted using multiple knowledge sources are given in Section 4.2.</Paragraph>
    </Section>
    <Section position="9" start_page="435" end_page="437" type="sub_section">
      <SectionTitle>
3.3 Calculation of Weights
</SectionTitle>
      <Paragraph position="0"> Each time a source language (SL) chunk is submitted for translation, the appropriate target language (TL) chunks are retrieved and returned with a weight attached. We use a maximum of six knowledge sources: * Stage 1: Three sets of translations (A, B, and C) are retrieved using each of the three on-line MT systems.</Paragraph>
      <Paragraph position="1">  ) acquired by breaking down the translations retrieved in Stage 1 using the marker hypothesis to form the marker lexicon, the generalized marker lexicon, and the word-level lexicon.</Paragraph>
      <Paragraph position="2"> Within each knowledge source, each translation is weighted according to the formula in (32): (32) weight = number of occurrences of the proposed translation total number of translations produced for SL phrase For the SL phrase the house, assuming that la maison is found eight times and le domicile is found twice, then P(la maison  |the house) = 8/10 and P(le domicile  |the house) = 2/10. Note that since each SL phrase will only have one proposed translation within each of the knowledge sources acquired at Stage 1, these translations will always have a weight of 1.</Paragraph>
      <Paragraph position="3"> If we wish to consider only those translations produced using a single MT system (e.g., A and A prime ), we add the weights of translations found in both knowledge sources and divide the weights of all proposed translations by two. For the SL phrase the house, assuming P(la maison  |the house) = 5/10 in knowledge source A and P(la maison  |the house) = 8/10 in A prime , then P(la maison  |the house) = 13/20 over both knowledge sources. Similarly, if we wish to consider translations produced by all three MT systems, then we add the weights of common translations and divide the weights of all proposed translations by six.</Paragraph>
      <Paragraph position="4"> When translated phrases have been retrieved for each chunk of the input string, they must then be combined to produce an output string. In order to calculate a ranking for each TL sentence produced, we multiply the weights of each chunk used in its construction. Note that this ensures that greater importance is attributed to longer chunks, as is usual in most EBMT systems (cf. Sato and Nagao 1990; Veale and Way 1997; Carl 1999).</Paragraph>
      <Paragraph position="5">  As an example, consider the translation into French of the house collapsed. Assume the conditional probabilities in (33): 7 Note that approaches that prefer the greatest context to be taken into account are not limited to EBMT. Research in the area of data-oriented parsing (cf. Bod, Scha, and Sima'an, 2003) also shows that unless the corpus is inherently biased, derivations constructed using the smallest number of subtrees have a higher probability than those built with a larger number of smaller subtrees.  (33) a. P(la maison  |the house) = 8/10 b. P(le domicile  |the house) = 2/10 c. P(s''ecroula  |collapsed) = 1/7 d. P(s'effondra  |collapsed) = 6/7  Given the weights in (33), the four translations in (34) can be produced, each with an associated probability:  Where different derivations result in the same TL string, their weights are summed and the duplicate strings are removed.</Paragraph>
      <Paragraph position="6"> The examples in (33) and (34) are reasonably straightforward if we assume, as here, that the chunks in (35) exist in the system databases shown:  Given the aligned segments in (36), the correct translations (37) would be built: (37) a. Une maison s''ecroula.</Paragraph>
      <Paragraph position="7"> b. Une maison s'effondra.</Paragraph>
      <Paragraph position="8"> c. Un domicile s''ecroula.</Paragraph>
      <Paragraph position="9"> d. Un domicile s'effondra.</Paragraph>
      <Paragraph position="10"> However, in addition, the mistranslations in (38) would be constructed: (38) a. *Un maison s''ecroula.</Paragraph>
      <Paragraph position="11">  Way and Gough wEBMT b. *Un maison s'effondra.</Paragraph>
      <Paragraph position="12"> c. *Une domicile s''ecroula.</Paragraph>
      <Paragraph position="13"> d. *Une domicile s'effondra.</Paragraph>
      <Paragraph position="14">  These mistranslations are all caused by boundary friction. Each of the translations in (37) and (38) would be output with an associated weight and ranked by the system. We would like to incorporate into our model a procedure whereby translation chunks extracted from the phrasal and marker lexicons are more highly regarded than those constructed by inserting words from the word-level lexicon into generalized marker chunks. That is, we want to allocate a larger portion of the probability space to the phrasal and marker lexicons than to the generalized or word-level lexicons. We have yet to import such a constraint into our model, but we plan to do so in the near future using the weighted majority algorithm (Littlestone and Warmuth 1992).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML