File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1026_metho.xml
Size: 13,411 bytes
Last Modified: 2025-10-06 14:13:49
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1026"> <Title>Toward Multi-Engine Machine Translation</Title> <Section position="3" start_page="0" end_page="148" type="metho"> <SectionTitle> 2. INTEGRATING MULTI-ENGINE OUTPUT </SectionTitle> <Paragraph position="0"> The MT configuration in our experiment used three MT engines: * a knowledge-based MT (K.B MT) system, the mainline Pangloss engine\[l\]; * an example-based MT (EBMT) system (see \[2, 3\]; the original idea is due to Nagao\[4\]); and * a lexical transfer system, fortified with morphological analysis and synthesis modules and relying on a number of databases - a machine-readable dictionary (the Collins Spanish/English), the lexicons used by the KBMT modules, a large set of user-generated bilingual glossaries as well as a gazetteer and a List of proper and organization names.</Paragraph> <Paragraph position="1"> The results (target language words and phrases) were recorded in a chart whose initial edges corresponded to words in the source language input. As a result of the operation of each of the MT engines, new edges were added to the chart, each labeled with the translation of a segment of the input string and indexed by this segment's beginning and end positions. The KBMT and EBMT engines also carried a quality score for each output element. Figure 1 presents a general view of the operation of our multi-engine MT system.</Paragraph> <Paragraph position="2"> In what follows we illustrate the behavior of the system using the example Spanish sentence: AI momento de su venta a lberia, VIASA contaba con ocho aviones, que tentan en promedio 13 afios de vuelo which can be translated into English as At the moment of its sale to lberia, VIASA had eight airplanes, which had on average thirteen years offlight (time). This is a sentence from one of the 1993 ARPA MT evaluation texts.</Paragraph> <Paragraph position="3"> The initial collecfion of candidate partial translations placed in the chart for this sentence by each individual engine are shown in Figures 2, 3, 4, 5, and 6. The chart manager selects the overall best cover&quot; from this collection of candidate partial translations by providing each edge with a normalized positive quality score (larger being better), and then selecting the best combination of edges with the help of the chart-walk algorithm.</Paragraph> <Paragraph position="4"> The scores in the chart are normalized to reflect the empirically derived expectation of the relative quality of output produced by a particular engine. In the case of K.BMT and EBMT, the pre-existing scores are modified, while edges from other engines receive scores determined by a constant for each engine.</Paragraph> <Paragraph position="5"> These modifications can include any calculation which can be made with information available from the edge. For example, currently the K.BMT scores are reduced by a constant, except for known erroneous output, which has its score set to zero. The EBMT scores initially range from 0 being perfect to 10,000 being totally bad; but the quality is nonlinear. So a region selected by two cutoff constants is converted by a simple linear equation into scores ranging from zero to a normalized maximum EBMT score. Lexical transfer results are scored based on the reliability of individual glossaries.</Paragraph> <Paragraph position="6"> In every case, the base score produced by the scoring functions is multiplied by the length of the candidate in words, on the assumption that longer items are better. This may be producing too large an effect on the chart-walk. We intend to test functions other than multiplication in order to find the right level of influence for length. The scoring functions represent all of the chart manager's knowledge about relative quality of edges. Once the edges are scored, the cover is produced using a simple dynamic programming algorithm, gle, best, non-overlapping, contiguous combination of the available component translations. The algorithm uses dynamic programming to find the optimal cover (a cover with the best cumulative score), assuming correct component quality scores. The code is organized as a recursive divide-and-conquer procedure: for each position within a segment, the sentence is split into two parts, the best possible cover for each part is recursively found, and the two scores are combined to give a score for the chart-walk containing the two best subwalks. This primitive step is repeated for each possible top-level split of the input sentence, compared with each other and with any simple edges (from the chart) spanning the segment, and the overall best result is used.</Paragraph> <Paragraph position="7"> Without dynamic programming, this would have a combinatorial time complexity. Dynamic programming utilizes a large array to store partial results, so that the best cover of any given subsequence is only computed once; the second time that a recursive call would compute the same result, it is retrieved from the array instead. This reduces the time complexity to polynomial, and in practice it uses an &quot;To the ..... Fo it&quot; &quot;To him&quot; &quot;To you&quot; moment instant time &quot;just a moment! .... in due time&quot; &quot;in due course&quot; &quot;when the time is right&quot; momentum consequence importance of from about for by its his her one's your their sale selling marketing countzy inn small shop stall booth who that whom which &quot;were have&quot; have &quot;have got&quot; &quot;were possess&quot; possess &quot;were hold&quot; hold &quot;hold on to&quot; &quot;hold up&quot; &quot;were grasp&quot; in on onto at by average middle mid-point</Paragraph> <Paragraph position="9"> of from about for by flight &quot;to dash off&quot; &quot;to clear off&quot; &quot;to leave the parental nest&quot; &quot;spread one's wings&quot; &quot;to overhear sth in passing&quot; &quot;to catch on immediately&quot; &quot;get it at once&quot; &quot;to be pretty smart&quot; 'flight feathers&quot; insignificant part of total processing time. The combined score for a sequence of edges is the weighted average of their individual scores. Weighting by length is necessary so that the same edges, when combined in a different order, produce the same combined scores. In other words, whether edges a, b, and c are combined as ((a b) c) or (a (b c)), the combined edge must have the same score, or the algorithm can produce inconsistent results. The chart-walk algorithm can also be visualized as a task of filling a two-dimensional array. The array for our example sentence is shown in Figure 8. Element (id) of the array is the best score for any set of edges covering the input from word i to word j. (The associated list of edges is not shown, for readability.) For any position, the score is To find best walk on a segment: if there is a stored result for this segment then return it else begin get all primitive edges for this segment for each position p within this segment begin split segment into two parts at p find best walk for first part find best walk for second part combine into an edge calculated as a weighted average of the scores in the row to its left, in the column below it and the previous contents of the array cell for its position. So to calculate element (1,4), we compare the combined scores of the best walks over (1,1) and (2,4), (1,2) and (3,4), and (1,3) and (4,4) with the scores of any chart edges going from 1 to 4, and take the maximum. When the score in the top-right comer is produced, the algorithm is finished, and the associated set of edges is the final chart-walk result.</Paragraph> <Paragraph position="10"> It may seem that the scores should increase towards the top-right comer. In our experiment, howevel~ this has not generally been the case. Indeed, the system suggested a number of high-scoring short edges, but many low-scoring edges had to be included to span the entire input. Since the score is a weighted average, these low-scoring edges pull it down. A clear example can be seen at position (18,18), which has a score of 15. The scores above and to its right each average this 15 with a 5, for total values of 10.0, and the score continues to decrease with distance from this point as one moves towards the final score, which does include (18,18) in the cover.</Paragraph> <Section position="1" start_page="148" end_page="148" type="sub_section"> <SectionTitle> 2.3. Reordering components </SectionTitle> <Paragraph position="0"> The chart-oriented integration of MT engines does not easily support deviations from the linear order of the source text elements, as when discontinuous constituents translate contiguous strings or in the case of cross-segmental substring order differences. Following a venerable tradition in MT, we used a target language-dependent set of postprocessing rules to alleviate this problem (e.g., by switching the order of adjectives and nouns in a noun phrase if it was produced by the word-for-word engine).</Paragraph> </Section> </Section> <Section position="4" start_page="148" end_page="149" type="metho"> <SectionTitle> 3. TRANSLATION DELIVERY SYSTEM </SectionTitle> <Paragraph position="0"> Results of multi-engine MT were fed in our experiment into a translator's workstation (TWS)\[5\], through which a translator either approved the system's output or modified it. The main option for human interaction in TWS currently is the Component Machine-Aided Translation (CMAT) editor\[6\]. A view of this editor is presented in Figure 9. (The user can see the original source language text in another editor window.) The user can use menus, function keys and mouse clicks to change the system's initially chosen candidate trans- null lation string, as well as perform both regular and enhanced editing actions.</Paragraph> <Paragraph position="1"> The phrases marked by double angle brackets are &quot;components&quot;, each of which is the first translation from a candidate chosen by the chart-walk. In the typical editing action shown, the user has clicked on a component to get the main CMAT menu. This menu shows the corresponding source text, and provides several functions (such as moving or deleting the whole constituent) and altsrnate translations, followed by the original source text as an option. If the user selects an alternate translation, it instantly replaces the component in the _ editor window, which becomes the first alternative in this menu ff it is used again. The alternate translations are the other translations from the chosen edge 1.</Paragraph> <Paragraph position="2"> Figure IO presents the sets of candidates in the best chart-walk that are presented as choices to the human user through the CMAT editor in our example. It also shows their individual engine-level quality ScoreS.</Paragraph> </Section> <Section position="5" start_page="149" end_page="149" type="metho"> <SectionTitle> 4. TESTING AND EVALUATING MULTI-ENGINE PERFORMANCE </SectionTitle> <Paragraph position="0"> As a development tool, it is useful to have an automatic testing procedure that would assess the utility of the multi-engine system relative to the engines taken separately. The best method we could come up with was counting the number of keystrokes, in an advanced text processor, such as the TWS, necessary to convert the outputs of individual engines and the multi-engine configuration to a&quot;canonical&quot; human translation. A sample test on a passage of 2060 characters from the June 1993 evaluation of Pangloss is shown in figure 11.</Paragraph> <Paragraph position="1"> The difference in keystrokes was calculated as follows: one keystroke for deleting a character, two keystrokes for inserting a character, three keystrokes for deleting a word (in an editor with 1 The CMAT editor may also include translations from other candidates, lower in the menu, if they have the same boundaries as the chosen candidate and the menu is not too long.</Paragraph> <Paragraph position="2"> mouse action); three keystrokes plus the number of characters in the word being inserted for inserting a word. It is clear fi'om the above table that the multi-engine configuration works bet~r than any of our available individual engines, though it still does not reach the quality of a Level 2 ~anslator. It is clear that using keysUokes as a measure is not completely satisfactory under the given conditions.</Paragraph> <Paragraph position="3"> It would be much better to make the comparison not against a single &quot;canonical&quot; translation but against a set of equivalent paraphrastic translations, the reason being that, as all translators know, there are many &quot;correct&quot; ways of translating a given input, so that a more appropriate test would be counting the number of keystrokes of difference between the system output and the closest member of the set of correct translation paraphrases. However, this is predicated on the availability of a &quot;'paraphraser&quot; system, developing which is not a trivial task.</Paragraph> <Paragraph position="4"> Type of translation Number of keystrokes to convert to canonical</Paragraph> </Section> class="xml-element"></Paper>