File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-1014_intro.xml
Size: 24,119 bytes
Last Modified: 2025-10-06 14:05:33
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1014"> <Title>A MATCIIING TECHNIQUE IN EXAMPLE,-BASED MACIIINE TRANSLATION Lambros CRANIAS, Harris PAPAGEORGIOU, Stelios PIPERIDIS Institute for Language and Speech Processing, GREECE</Title> <Section position="3" start_page="0" end_page="1220" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> EBMT is based on the idea of performing translation by imitating translation examples of similar sentences \[Nagao 84\]. In this type of translation system, a large amount of bi/multi-lingual translation examples has been stored in a textual database and input expressions are rendered in the target language by retrieving from the database that example which is most similar to the input.</Paragraph> <Paragraph position="1"> There are three key issues which pertain to example-based translation : (r) establishment of correspondence between units in a bi/multi-lingual text at sentence, phrase or word level * a mechanism for retrieving from the database the unit that best matches the input * exploiting the retrieved translation example to produce the actual translation of the input sentence \[Brown 91\] and \[Gale 91\] have prolx~Sed methods for establishing correspondence between sentences in bilingual corpora. \[Brown 93\], \[Sadler 901 and \[Kaji 92\] have tackled the problem of establishing correspondences between words and phrases in bilingual texts.</Paragraph> <Paragraph position="2"> The third key issue of EBMT, that is exploiting the retrieved translation example, is usually dealt with by integrating into the system conventional MT techniques \[Kaji 92\], \[Sumita 91\]. Simple modifications of the translation proposal, such as word substitution, would also be possible, provided that alignment of the translation archive at word level was awdlable.</Paragraph> <Paragraph position="3"> In establishing a mechanism for the best match retrieval, which is the topic of this paper, the crucial tasks are: (i) determining whether the search is for matches at sentence or sub-sentence level, that is determining the &quot;text unit&quot;, and (ii) the definition of the metric of similarity between two text units.</Paragraph> <Paragraph position="4"> As far as (i) is concerned, the olwious choice is to use as text unit the sentence. This is because, not only are sentence Ixmndaries unambiguous hut also translation propo~ls at sentence level is what a translator is usually looking for. Sentences can, however, be quite long. And the longer they are, the less possible it is that they will have an exact match in the translation archive, and the less flexible the EBMT system will be.</Paragraph> <Paragraph position="5"> On the other hand if the text unit is the sub-sentence we lace one major problem, that is the possibility that the resulting translation of the whole sentence will be of low quality, due to Ixmndary friction and incorrect chunking. In practice, EBMT systems that operate at sub-sentence level involve the dynamic derivation of the optimum length of segments of the input sentence by analysing the available parallel corpora. This rexluires a procedure for determining the best &quot;cover&quot; of an input text by segments of sentences contained in the database \[Nirenburg 93\]. It is assumed that the translation of the segments of the database that cover the input sentence is known. What is needed, therefore, is a procedure lbr aligning parallel texts at sub-sentence level \[Kaji 921, \[Sadler 901. If sub-sentence alignment is available, the approach is fully automated but is quite vulnerable to the problem of luw quality as mentioned above, as well as to ambiguity problems when the produced segments are rather small.</Paragraph> <Paragraph position="6"> Despite the fact that almost all running EBMT systems employ the sentence as the text unit, it is believed that the potential of EBMT lies on the exploitation of fragments of text snualler that sentences and the combination of such fragments to produce the translation of whole sentences \[Sato 90\]. Automatic sub-sentential alignment is, however, a problem yet to be solved.</Paragraph> <Paragraph position="7"> Turning to the definition of the metric of similarity, the requirement is usually twotold. The similarity metric applied to two sentences (by sentence from now on we will refer to both sentence and sub-sentence fragmen0 should indicate how similar the compared sentences are, and perhaps the parts of the two ~ntences that contributed to the similarity score. The latter could be just a useful indication to the translator using the EBMT system, or a crucial functional factor of the system as will be later explained.</Paragraph> <Paragraph position="8"> The similarity metrics reported in the literature can be characterised depending on the text patterns they are applied on. So, the word-based metrics compare individual words of the two sentences in terms of their morphological paradigms, synonyms, hyperonyms, hyponyms, antonyms, pos tags... \[Nirenburg 93\] or use a semantic distance d (0~d<l) which is determinM by the Most Specific Common Abstraction (MSCA) obtained from a thesaurus abstraction hierarchy \[Sumita 91\]. Then, a similarity metric is devised, which reflects the similarity of two sentences, hy combining the individual contributions towards similarity stemming from word comparisons.</Paragraph> <Paragraph position="9"> The word-based metrics are the most tx)pular, but other approaches include syntax-rule driven metrics \[Sumita 88\], character-based metrics \[Sato 921 as well as some hybrids \[Furuse 921. The character-based metric has been applied to Japanese, taking advantage of certain characteristics of the Japanese. The syntax-rule driven metrics try to capture similarity of two sentences at the syntax level. This seems very promising, since similarity at the syntax level, perhaps coupled by lexical similarity in a hybrid configuration, would be the best the EBMT system could ofler as a translation propo~l.</Paragraph> <Paragraph position="10"> '/'he real time feasibility of such a system is, however, questionable since it involves the complex task of syntactic analysis.</Paragraph> <Paragraph position="11"> In section II a similarity metric is proposed and analysed. The statistical system presented consists of two phases, tire 12arning and the decision nmking or Recognition phase, which are described in section I11.</Paragraph> <Paragraph position="12"> Finally, in section IV the experiment configuration is discussed and the results evaluated.</Paragraph> <Paragraph position="13"> I1. TItE SIMILARITY METRIC To encode a ~ntence into a vector, we exploit information about the functional words/phrases (fws) appearing in it, as well as about the lemnms and pos (part-of speech) tags of the words aplrearing Iretwcen fws/phrases. Based on tile combination of fws/phrases data and pos tags, a simple view of the surf~tce syntactic structure of each sentence is obtained.</Paragraph> <Paragraph position="14"> To identify the fws/phrases in a given corpus the tollowing criteria are applied : * fws introduce a syntactically standard behaviour * most of the fws belong to closed classes.</Paragraph> <Paragraph position="15"> * the semantic behaviour of fws is determined through their context * most of the fws determine phrase boundaries fws have a relatively high frequency in the corpus According to these criteria, prepositions, conjunctions, determiners, pronouns, certain adverbials etc. are regarded as fws. Having identified the fws of the corpus we distinguish groups offws on tire basis of their interchangeability in certain phrase structures. The grouping eaters, also, for the multiplicity of usages of a certain word which has been identified as a fw, since a fvC/ can be a part of many different groups. In this way, fws can serve the retrieval procedure with respect to the following two levels of contribution towards the sinlilarity score of two sentences : Identity of fws of retrieved example and input (I) fws of retrieved example ~md input not identical hut lrelonging to tire same group (G) To obtain the lenmms and pos tags of the remaining words in a sentence, we use a part-of-speech Tagger with n(__2 disambiguation module, since tiffs would Ire time consuming and not 100% accurate. Instead, we introduce the concept of mnbiguity class (ac) and we represent each non-fw by its ac and the corresponding lemnm(s) (for example, the unambiguous word &quot;eat&quot; would be represented by the ac which is the set {verb} and the lemnm &quot;eat&quot;) (in English, foe an ambiguous word, the corresponding lcnnuas will usually be identical. But this is rarely true tot Greek). Hence, tlle tbllowing two levels of contribution to the similarity score stem from non-fws : * overlapping of tlre sets of possible lemmas of the two words (I,) * overlapping of the ambiguity classes of the two words (W) llence, each sentence of the source part of the translation archive is represented by a pattern, which is expressed as an ordered series of tile above mentioned feature components.</Paragraph> <Paragraph position="16"> A similarity metric is defined between two such vectors, and is used in both the Learuing and Recognition phases. Comparing a test vector against a reference vector is, however, not straightfi)rward, since there are generally axis fluctuations between the vectors (not necessarily aligned vectors and of most probably different length). To overcome these problems we use a two-level Dynamic Programming (DP) technique ISakoe 78\], INey 84\]. The first level treats the matches at fw level, while tile second is reached only in case of a match in the first level, and is concerned with the lemmas and tags of the words within fw boundaries.</Paragraph> <Paragraph position="17"> Both levels utilise the ~me (DP) model which is next described.</Paragraph> <Paragraph position="18"> We have already referred to the (I) and (G) contributions to the similarity score due to fws. But this is not enough. We should also take into account whether the fws appear in the same order in the two sentences, whether an extra (or a few) fws intervene in one of the two sentences, whether certain fws are missing ... To deal with these problems, we introduce a yet third contribution to the similarity score, which is negative and is called penalty score (P). So, as we are moving along a diagonal of the xy-plane (corresponding to matched fws), whenever a fw is mismatched, it produces a negative contribution to the score along a horizontal or vertical direction. In figure 1 the allowable transitions in the xy-plane are shown.</Paragraph> <Paragraph position="19"> Whenever a diagonal transition is investigated, the system calls the second level DP-algorithm which produces a local additional score due to the potential similarity of lemmas and tags of the words lying between the corresponding fws. This score is calculated using exactly the same DP-algorithm as the one treating fws (allowing additions, deletions,...), provided that we use (L), cr) and (PT) (a penalty score attributed to a mismatch at the tag-level) in place of (I), (G) and (P) respectively.</Paragraph> <Paragraph position="20"> The outcome of the DP-algorithm is the similarity score between two vectors which allows for different lengths of the two sentences, similarity of different parts of the two sentences (last part of one with the first part of the other) and finally variable number of additions and deletions. The score produced, corresponds to two coherent parts of the two sentences under comparison.</Paragraph> <Paragraph position="21"> Emphasis should be given to the variable number of additions and deletions. The innovation of the penalty score (which is in fact a negative score) provides the system with the flexibility to afford a different number of additions or deletions depending on the accumulated similarity score up to the point where these start.</Paragraph> <Paragraph position="22"> Moreover, the algorithm determines, through a backtracking procedure, the relevant parts of the two vectors that contributed to this score. This is essential for the sentence segmentation described in the next section.</Paragraph> <Paragraph position="23"> It should also be noted that the similarity score produced is based mainly on the surface syntax of the two sentences (as this is indicated by the fws and pos tags) and in the second place on the actual words of the two sentences. This is quite reasonable, since the two sentences could have almost the same words in the source language but no similarity at all in th~ source or target language (due to different word order, as well as different word utilisation), while if they are similar in terms of fws as well as in terms of the pos tags of the words between fws, then the two sentences would ahnost certainly be similar (irrelevant of a few differences in the actual words) in the target language as well (which is the objective).</Paragraph> <Paragraph position="24"> The DP-algorithm proposed seems to be tailored to the needs of the similarity metric but there is yet a * crucial set of parameters to be set, that is ,~={I,G,P,L,T,PT}. The DP-algorithm is just the framework for the utilisation of these parameters. The values of the parameters of A are set dynamically depending on the lengths of the sentences under comparison. 1, G, L, T are set to values (I, G are nornudised by the lengths of the sentences in fws, while L, T are normalised by the lengths of the blocks of words appearing between fws) which produce a 100% similarity score when the sentences are identical, while P, PT reflect the user's choice of penalising an addition or deletion of a word (functional or not).</Paragraph> <Paragraph position="25"> IlL LEARNING AND RECOGNITION PIIASES In the Learning phase, the modified k-means clustering procedure \[Wilpon 8511 is applied to the source part of the translation archive, aiming to produce clusters of sentences, each represented by its centre only. The algorithm produces the optimum segmentation of the corpus into clusters (based on the similarity metric), and determines each cluster centre (which is just a sentence of the corpus) by using the minmax criterion. The number of clusters can be determined automatically by the process, subject to some cluster quality constraint (for example, minimum intra-cluster similarity), or alternatively can be determined externally based upon memory-space restrictions and speed requirements.</Paragraph> <Paragraph position="26"> Once the clustering procedure is terminated, a search is nmde, among the sentences allocated to a cluster, to locate second best (but good enough) nuttches to the sentences allocated to the remaining clusters. If such matches are traced, the relevant sentences are segmented and then the updated corpus is reclustered. After a number of iterations, convergence is obtained (no new sentence segments are created) and the whole clustering procedure is terminated.</Paragraph> <Paragraph position="27"> Although tile objective of a matching mechanism should be to identify in a database the longest piece of text that best matches the input, the rationale behind sentence segmentation is in this case self-evident. It is highly probable that a sentence is allocated to a cluster center because of a good match due to a part of it, while tile ten'mining part has nothing to do with the cluster to which it will be allocated. Hence, this part will remain hidden to an input sentence applied to the system at the recognition phase. On the other hand, it is also highly probable that a given input sentence does not, as a whole, match a corpus sentence, but rather different parts of it match with segments belonging to different sentences in the corpus. Providing whole sentences as translation proposals, having a part that matched with part of the input sentence, would perhaps puzzle the translator instead of help him (her).</Paragraph> <Paragraph position="28"> But senten6e segmentation is not a straightforward matter. We can not just segment a sentence at the limits of the part that led to the allocation of the sentence to a specific cluster. This is because we need to know the translation of this part as well. tlence, we should expand the limits of the match to cover a &quot;translatable unit&quot; and then segment the sentence. Autoumtic snbsentential alignment (which would produce the &quot;translatable units&quot;), however, is not yet mature enough to produce high fidelity results, l-lence, one resorts to the use of senti-automatic methods (in our application with the CELEX database, because of the certain format in which the texts appear, a rough segmentation of the sentences is straightforward and can therefore be automated).</Paragraph> <Paragraph position="29"> If alignment at sub-sentential level is not available, the segmentation of the sentences of the corpus is not possible (it is absolutely pointless). Then, the degree of success of the Learning ph&~ will depend on the length of the sentences contained in the corpus. The longer these sentences tend to be, the less successful the Learning pha~. On the other hand, if alignment at sub-sentential level is available, we could just apply the clustering procedure to these segments. But then, we might end up with an uunecessary large number of clusters and &quot;sentences'. This is becau~, in a specific corpus quite a lot of these segments tend to appear together. Hence, by clustering whole sentences and then segmenting only in case of a good match with a part of a sentence allocated to a different cluster, we can avoid the overgeneration of clusters and segments. When the iterative clustering procedure is finally terminated, the .sentences of the original corpus will have been segmented to &quot;translatable units&quot; in an optimum way, so that they are efficiently represented by a set of sentences which are the cluster centres.</Paragraph> <Paragraph position="30"> In the Recognition p &quot;lmse, the vector of the inlmt sentence is extracted and compared against the cluster centres. Once the favourite cluster(s) is specified, the search space is limited to the sentences allocated to that cluster only, and the same similarity metric is applied to produce the best match available in the corpus. If the .sentences in the translation archive have been segmented, the problem is that, now, we do not know what the &quot;translatable units&quot; of the inpot sentence are (since we do not know its target language equivalent).</Paragraph> <Paragraph position="31"> We only have potential &quot;translatable unit&quot; nmrkers. This is not really a restriction, however, since by setting a high enough threshold for the nmtch with a segment (translatable piece of text) in the corpus, we can be sure that the part of the input sentence that contributed to this good umtch, will also be translatable and we can, therefore, segment this palt. This process continues until the whole input sentence has been &quot;covered&quot; by .segments of the corpus.</Paragraph> <Paragraph position="32"> IV. TIlE API&quot;LICATION - EVALUATION The development of the nmtching method presented in this paper was part of the research work conducted under the LRE I project TRANSLEARN. The project will initially consider four languages: English, French, Greek and Portuguese. The application on which we are developing and testing the method is implemented on the Greek-English language pair of records of the CELEX &ttabase, the computerised documentation system on Community Law, which is available in all Community languages. The matching mechanism is, so far, implemented on the Greek part, providing English translation proposals for Greek input sentences. The sentences contained in tile CELEX database tend to be quite long, but due to tile certain forn~d in which they appear (corresponding to articles, regulations,...), we were able to provide the Learning phase with some potential segmentation points of these sentences in b~)th hmguages of the pair (these segmentation points are in c~ne-to-one correspondence across languages, yielding the &quot;sub-sentence&quot; alignment).</Paragraph> <Paragraph position="33"> In tagging the Greek part of the CELEX database we came across 31 different ambiguity classes, which are utilised in the matching naechanism. The identification and grouping of the Greek fws was mainly done with the help of statistical tools applied to the CELEX database.</Paragraph> <Paragraph position="34"> We tested the system on 8,000 .sentences of the CELEX database. We are presenting results on two versions. One of 80 clusters (which accounts for the 1% of the nnmber of the sentences of the corpus used) which resulted in 10,203 &quot;sentences&quot; (sentences or segments) in 2 iterations, and one of 160 clusters which resulted in 10,758 &quot;sente,lces&quot; in 2 iterations. To evaluate the system, we asked live translators to assign each translation proposal of the system (in our application these proposals sometimes refer to segments of the input ~ntence) to one of four categories : A : The proposal is the correct (or almost) translation B : The proposal is very helpful in order to produce the translation C: The proposal can help in order to produce tile translation.</Paragraph> <Paragraph position="35"> D : The proposal is of no use to the translator.</Paragraph> <Paragraph position="36"> We used as test suite 200 sentences of the CELEX database which were not incltlded in the translation archive. The system proposed translations for 232 &quot;sentences&quot; (segments or whole input sentences) in tile former case and for 244 in the latter case. The results are tabulated in table 1 (these results refer to the single best match located in the translation archive) The table shows that in the case of 160 clusters, (I) at 62 % the system will be very useful to the translator, and (2) some information can at least be obtained from 82% of the retrievals. In the case of 80 clusters the results do not change significantly. Hence, as far as the similarity mettle is concerned the results seem quite promising (it should, however, be mentioned, that the CELEX database is quite suitable for EBMT applications, due to its great degree of repetitiveness). On the other hand, the use of clustering of the corpus dramatically decreases the response time of the system, compared to the alternative of searching exhaustively through the corpus. Other methods ibr limiting the search space do exist (for example, using full-text retrieval based on content words), but are rather lossy, while clustering provides an effective means of locating the best available match in the corpus (in ternts of the similarity metric employed). This can be seen in Table 2, where the column &quot;MISSED&quot; indicates the percentage of the input &quot;sentences&quot; for which the best match in the corpus was not located in the favourite cluster, while the column &quot;MISSED BY&quot; indicates the average deviation of the located best matches from the actual best matches in the corpus for th~se cases.</Paragraph> <Paragraph position="37"> In Table 1 as well as in Table 2 it can be seen that a quite important decrease in the number of clusters affected the results only slightly. This small deterioration in the performance of the system is due to &quot;hidden&quot; parts of sentences allocated to clusters (parts that are not represented by the cluster centres). Hence, the smaller the &quot;sentences&quot; contained in the database and the more the clusters, the better the performance of the proposed system. The number of clusters, however, should be constrained for the search space to be effectively limited.</Paragraph> </Section> class="xml-element"></Paper>