Machine Translation of Very Close Languages
 Jan HAJI(~ Computer Science Dept. Johns Hopkins University 3400 N. Charles St., 
Baltimore, MD 21218, USA hajic@cs.jhu.edu Jan HRIC KTI MFF UK Malostransk6 
nfim.25 Praha 1, Czech Republic, 11800 hric@barbora.m ff.cuni.cz Vladislav KUBON 
OFAL MFF UK Malostransk6 mim.25 Praha 1, Czech Republic, 11800 
vk@ufal.mff.cuni.cz
 
Abstract
 Using examples of the transfer-based MT system between Czech and Russian RUSLAN 
and the word-for-word MT system with morphological disambiguation between Czech 
and Slovak (~ESILKO we argue that for really close s it is possible to 
obtain better translation quality by means of simpler methods. The problem of 
translation to a group of typologically similar s using a pivot  
is also discussed here.
 
demonstrate that this assumption holds only for really very closely related 
s.
 
1. Czech-to-Russian MT system RUSLAN 1.1 History
 
Introduction
 Although the field of machine translation has a very long history, the number of 
really successful systems is not very impressive. Most of the funds invested 
into the development of various MT systems have been wasted and have not 
stimulated a development of techniques which would allow to translate at least 
technical texts from a certain limited domain. There were, of course, 
exceptions, which demonstrated that under certain conditions it is possible to 
develop a system which will save money and efforts invested into human 
translation. The main reason why the field of MT has not met the expectations of 
sci-fi literature, but also the expectations of scientific community, is the 
complexity of the task itself. A successful automatic translation system 
requires an application of techniques from several areas of computational 
linguistics (morphology, syntax, semantics, discourse analysis etc.) as a 
necessary, but not a sufficient condition. The general opinion is that it is 
easier to create an MT system for a pair of related s. In our 
contribution we would like to
 
The first attempt to verify the hypothesis that related s are easier to 
translate started in mid 80s at Charles University in Prague. The project was 
called RUSLAN and aimed at the translation of documentation in the domain of 
operating systems for mainframe computers. It was developed in cooperation with 
the Research Institute of Mathematical Machines in Prague. At that time in 
former COMECON countries it was obligatory to translate any kind of 
documentation to such systems into Russian. The work on the Czech-to-Russian MT 
system RUSLAN (cf. Oliva (1989)) started in 1985. It was terminated in 1990 
(with COMECON gone) for the lack of funding.
 
System description
 
The system was rule-based, implemented in Colmerauer's Q-systems. It contained a 
fullfledged morphological and syntactic analysis of Czech, a transfer and a 
syntactic and morphological generation of Russian. There was almost no transfer 
at the beginning of the project due to the assumption that both s are 
similar to the extent that does not require any transfer phase at all. This 
assumption turned to be wrong and several phenomena were covered by the transfer 
in the later stage of the project (for example the translation of the Czech verb 
"b~" [to be] into one of the three possible Russian equivalents: empty form, the 
form "byt6" in future
 
fitense and the verb "javljat6sja"; or the translation of verbal negation). At 
the time when the work was terminated in 1990, the system had a main translation 
dictionary of about 8000 words, accompanied by so called transducing dictionary 
covering another 2000 words. The transducing dictionary was based on the 
original idea described in Kirschner (1987). It aimed at the exploitation o f 
the fact that technical terms are based (in a majority o f European s) 
on Greek or Latin stems, adopted according to the particular derivational rules 
o f the given s. This fact allows for the "translation" o f technical 
terms by means of a direct transcription of productive endings and a slight 
(regular) adjustment o f the spelling of the stem. For example, the English 
words localization and discrimination can be transcribed into Czech as 
"lokalizace" and "diskriminace" with a productive ending -ation being 
transcribed to -ace. It was generally assumed that for the pair Czech/Russian 
the transducing dictionary would be able to profit from a substantially greater 
number o f productive rules. This hypothesis proved to be wrong, too (see 
B6mov~, Kubofi (1990)). The set o f productive endings for both pairs 
(English/Czech, as developed for an earlier MT system from English to Czech, and 
Czech/Russian) was very similar. The evaluation o f results o f RUSLAN showed 
that roughly 40% o f input sentences were translated correctly, about 40% with 
minor errors correctable by a human post-editor and about 20% of the input 
required substantial editing or re-translation. There were two main factors that 
caused a deterioration of the translation. The first factor was the 
incompleteness o f the main dictionary of the system. Even though the system 
contained a set of so-called fail-soft rules, whose task was to handle such 
situations, an unknown word typically caused a failure o f the module o f 
syntactic analysis, because the dictionary entries contained - besides the 
translation equivalents and morphological information - very important syntactic 
information. The second factor was the module of syntactic analysis o f Czech. 
There were several reasons of parsing failures. Apart from the common inability 
of most rule-based formal grammars to cover a
 
particular natural  to the finest detail o f its syntax there were other 
problems. One o f them was the existence of non-projective constructions, which 
are quite common in Czech even in relatively short sentences. Even though they 
account only for 1.7�/'o of syntactic dependencies, every third Czech sentence 
contains at least one, and in a news corpus, we discovered as much as 15 
non-projective dependencies; see also Haji6 et al. (1998). An example o f a 
non-projective construction is "Soubor se nepodafilo otev~it." [lit.: File Refl. 
was_not._possible to_open. - It was not possible to open the file]. The 
formalism used for the implementation (Q-systems) was not meant to handle 
non-projective constructions. Another source of trouble was the use o f 
so-called semantic features. These features were based on lexical semantics o f 
individual words. Their main task was to support a semantically plausible 
analysis and to block the implausible ones. It turned out that the question o f 
implausible combinations o f semantic features is also more complex than it was 
supposed to be. The practical outcome o f the use o f semantic features was a 
higher ratio of parsing failures - semantic features often blocked a plausible 
analysis. For example, human lexicographers assigned the verb 'to run' a 
semantic feature stating that only a noun with semantic features o f a human or 
other living being may be assigned the role o f subject of this verb. The input 
text was however full o f sentences with 'programs' or 'systems' running etc. It 
was o f course very easy to correct the semantic feature in the dictionary, but 
the problem was that there were far too many corrections required. On the other 
hand, the fact that both s allow a high degree o f word-order freedom 
accounted for a certain simplification o f the translation process. The grammar 
relied on the fact that there are only minor word-order differences between 
Czech and Russian. 1.3 Lessons learned from RUSLAN
 
We have learned several lessons regarding the MT o f closely related s: 
� The transfer-based approach provides a similar quality o f translation both 
for closely related and typologically different s � Two main bottlenecks 
o f full-fledged transfer-based systems are:
 
fi- complexity o f the syntactic dictionary - relative unreliability o f the 
syntactic analysis of the source  Even a relatively simple component 
(transducing dictionary) was equally complex for English-to-Czech and 
Czech-to-Russian translation Limited text domains do not exist in real life, it 
is necessary to work with a high coverage dictionary at least for the source 
.
 
2. Translation and localization 2.1 A pivot 
 
Localization o f products and their documentation is a great problem for any 
company, which wants to strengthen its position on foreign  market, 
especially for companies producing various kinds o f software. The amounts o f 
texts being localized are huge and the localization costs are huge as well. It 
is quite clear that the localization from one source  to several target 
s, which are typologically similar, but different from the source 
, is a waste of money and effort. It is o f course much easier to 
translate texts from Czech to Polish or from Russian to Bulgarian than from 
English or German to any o f these s. There are several reasons, why 
localization and translation is not being performed through some pivot , 
representing a certain group o f closely related s. Apart from political 
reasons the translation through a pivot  has several drawbacks. The most 
important one is the problem o f the loss o f translation quality. Each 
translation may to a certain extent shift the meaning o f the translated text 
and thus each subsequent translation provides results more and more different 
from the original. The second most important reason is the lack of translators 
from the pivot to the target , while this is usually no problem for the 
translation from the source directly to the target .
 
MAHT (Machine-aided human translation) systems. We have chosen the TRADOS 
Translator's Workbench as a representative system o f a class o f these 
products, which can be characterized as an example-based translation tools. 
IBM's Translation Manager and other products also belong to this class. Such 
systems uses so-called translation memory, which contains pairs o f previously 
translated sentences from a source to a target . When a human translator 
starts translating a new sentence, the system tries to match the source with 
sentences already stored in the translation memory. If it is successful, it 
suggests the translation and the human translator decides whether to use it, to 
modify it or to reject it. The segmentation o f a translation memory is a key 
feature for our system. The translation memory may be exported into a text file 
and thus allows easy manipulation with its content. Let us suppose that we have 
at our disposal two translation memories - one human made for the source/pivot 
 pair and the other created by an MT system for the pivot/target 
 pair. The substitution o f segments o f a pivot  by the 
segments of a target  is then only a routine procedure. The human 
translator translating from the source  to the target  then gets 
a translation memory for the required pair (source/target). The system o f 
penalties applied in TRADOS Translator's Workbench (or a similar system) 
guarantees that if there is already a human-made translation present, then it 
gets higher priority than the translation obtained as a result o f the automatic 
MT. This system solves both problems mentioned above the human translators from 
the pivot to the target  are not needed at all and the machinemade 
translation memory serves only as a resource supporting the direct human 
translation from the source to the target .
 
3. M a c h i n e
 
translation
 
o f (very) closely
 
related Slavic s
 In the group o f Slavic s, there are more closely related s than 
Czech and Russian. Apart from the pair o f Serbian and Croatian s, which 
are almost identical and were
 
Translation memory is the key
 
The main goal of this paper is to suggest how to overcome these obstacles by 
means o f a combination of an MT system with commercial
 
ficonsidered one  just a few years ago, the most closely related 
s in this group are Czech and Slovak. This fact has led us to an 
experiment with automatic translation between Czech and Slovak. It was clear 
that application of a similar method to that one used in the system RUSLAN would 
lead to similar results. Due to the closeness of both s we have decided 
to apply a simpler method. Our new system, (~ESILKO, aims at a maximal 
exploitation of the similarity of both s. The system uses the method of 
direct word-for-word translation, justified by the similarity of syntactic 
constructions of both s. Although the system is currently being tested 
on texts from the domain of documentation to corporate information systems, it 
is not limited to any specific domain. Its primary task is, however, to provide 
support for translation and localization of various technical texts. 3.1 System 
( ~ E S i L K O
 
and its governing noun. An alternative way to the solution of this problem was 
the application of a stochastically based morphological disambiguator 
(morphological tagger) for Czech whose success rate is close to 92�/'0. Our 
system therefore consists of the following modules: 1. Import of the input from 
so-called 'empty' translation memory 2. Morphological analysis of Czech 3. 
Morphological disambiguation 4. Domain-related bilingual glossaries (incl. 
single- and multiword terminology) 5. General bilingual dictionary 6. 
Morphological synthesis of Slovak 7. Export of the output to the original 
translation memory Letus now look in a more detail at the individual modules of 
the system: ad 1. The input text is extracted out of a translation memory 
previously exported into an ASCII file. The exported translation memory (of 
TRADOS) has a SGML-Iike notation with a relatively simple structure (cf. the 
following example): Example 1. - A sample of the exported translation memory 
<RTF Preamble>...</RTF Preamble> <TrU> <CrD>23051999 <CrU>VK <Seg L=CS_01>Pomoci 
v~kazu ad-hoc m65ete rychle a jednoduge vytv~i~et regerge. <Seg L=SK_01 >n/a 
</TrU> Our system uses only the segments marked by <Seg L=CS_01>, which contain 
one source  sentence each, and <Seg L=SK_01>, which is empty and which 
will later contain the same sentence translated into the target 
 
The greatest problem of the word-for-word translation approach (for s 
with very similar syntax and word order, but different morphological system) is 
the problem of morphological ambiguity of individual word forms. The type of 
ambiguity is slightly different in s with a rich inflection (majority of 
Slavic s) and in s which do not have such a wide variety of 
forms derived from a single lemma. For example, in Czech there are only rare 
cases of part-of-speech ambiguities (st~t [to stay/the state], zena 
[woman/chasing] or tri [three/rub(imperative)]), much more frequent is the 
ambiguity of gender, number and case (for example, the form of the adjective 
jam[ [spring] is 27-times ambiguous). The main problem is that even though 
several Slavic s have the same property as Czech, the ambiguity is not 
preserved. It is distributed in a different manner and the "form-for-form" 
translation is not applicable. Without the analysis of at least nominal groups 
it is often very difficult to solve this problem, because for example the actual 
morphemic categories of adjectives are in Czech distinguishable only on the 
basis of gender, number and case agreement between an adjective
 
by CESiLKO.
 ad 2. The morphological analysis of Czech is based on the morphological 
dictionary developed by Jan Haji6 and Hana Skoumalov~i in 1988-99 (for latest 
description, see Haji~ (1998)). The dictionary contains over 700 000 dictionary 
entries and its typical coverage varies between
 
fi99% (novels) to 95% (technical texts). The morphological analysis uses the 
system of positional tags with 15 positions (each morphological .category, such 
as Part-of-speech, Number, Gender, Case, etc. has a fixed, singlesymbol place in 
the tag). Example 2 - tags assigned to the word-form "pomoci" (help/by means of) 
pomoci: NFP2 ...... A .... ]NFS7 ...... A .... I R--2 ........... where : N - 
noun; R - preposition F - feminine gender S - singular, P - plural 7, 2 - case 
(7 - instrumental, 2 - genitive) A - affirmative (non negative) ad 3. The module 
of morphological disambiguation is a key to the success o f the translation. It 
gets an average number of 3.58 tags per token (word form in text) as an input. 
The tagging system is purely statistical, and it uses a log-linear model of 
probability distribution see Haji~, Hladkfi (1998). The learning is based on a 
manually tagged corpus of Czech texts (mostly from the general newspaper 
domain). The system learns contextual rules (features) automatically and also 
automatically determines feature weights. The average accuracy o f tagging is 
between 91 and 93% and remains the same even for technical texts (if we 
disregard the unknown names and foreign- terms that are not ambiguous 
anyway). The lemmatization immediately follows tagging; it chooses the first 
lemma with a possible tag corresponding to the tag selected. Despite this simple 
lemmatization method, and also thanks to the fact that Czech words are rarely 
ambiguous in their Part-of-speech, it works with an accuracy exceeding 98%.
 
The multiple-word terms are sequences of lemmas (not word forms). This structure 
has several advantages, among others it allows to minimize the size of the 
dictionary and also, due to the simplicity of the structure, it allows 
modifications of the glossaries by the linguistically naive user. The necessary 
morphological information is introduced into the domain-related glossary in an 
off-line preprocessing stage, which does not require user intervention. This 
makes a big difference when compared to the RUSLAN Czech-to-Russian MT system, 
when each multiword dictionary entry cost about 30 minutes of linguistic 
expert's time on average. ad 5. The main bilingual dictionary contains data 
necessary for the translation o f both lemmas and tags. The translation of tags 
(from the Czech into the Slovak morphological system) is necessary, because due 
to the morphological differences both systems use close, but slightly different 
tagsets. Currently the system handles the 1:1 translation of tags (and 2:2, 3:3, 
etc.). Different ratio of translation is very rare between Czech and Siovak, but 
nevertheless an advanced system of dictionary items is under construction (for 
the translation 1:2, 2:1 etc.). It is quite interesting that the lexically 
homonymous words often preserve their homonymy even after the translation, so no 
special treatment of homonyms is deemed necessary. ad 6. The morphological 
synthesis of Slovak is based on a monolingual dictionary of SIovak, developed by 
J.Hric (1991-99), covering more than ]00,000 dictionary entries. The coverage of 
the dictionary is not as high as o f the Czech one, but it is still growing. It 
aims at a similar coverage of Slovak as we enjoy for Czech. ad 7. The export o f 
the output of the system (~ESILKO into the translation memory (of TRADOS 
Translator's Workbench) amounts mainly to cleaning of all irrelevant SGML 
markers. The whole resulting Slovak sentence is inserted into the appropriate 
location in the original translation memory file. The following example also 
shows that the marker <CrU> contains an information that the target  
sentence was created by an M T system.
 
ad 4. The domain-related bilingual glossaries contain pairs of individual words 
and pairs of multiple-word terms. The glossaries are organized into a hierarchy 
specified by the user; typically, the glossaries for the most specific domain 
are applied first. There is one general matching rule for all levels of 
glossaries - the longest match wins.
 
s, namely for Czech-to-Polish translation. Although these s are 
not so similar as Czech and Slovak, we hope that an addition of a simple partial 
noun phrase parsing might provide results with the quality comparable to the 
fullfledged syntactic analysis based system RUSLAN (this is of course true also 
for the Czechoto-Slovak translation). The first results of Czech-to Polish 
translation are quite encouraging in this respect, even though we could not 
perform as rigorous testing as we did for Slovak.
 
Acknowledgements 3.2 Evaluation of results
 The problem how to evaluate results of automatic translation is very difficult. 
For the evaluation of our system we have exploited the close connection between 
our system and the TRADOS Translator's Workbench. The method is simple - the 
human translator receives the translation memory created by our system and 
translates the text using this memory. The translator is free to make any 
changes to the text proposed by the translation memory. The target text created 
by a human translator is then compared with the text created by the mechanical 
application of translation memory to the source text. TRADOS then evaluates the 
percentage of matching in the same manner as it normally evaluates the 
percentage of matching of source text with sentences in translation memory. Our 
system achieved about 90% match (as defined by the TRADOS match module) with the 
results of human translation, based on a relatively large (more than 10,000 
words) test sample. This project was supported by the grant GAt~R 405/96/K214 
and partially by the grant GA(~R 201/99/0236 and project of the Ministry of 
Education No. VS96151.

The accuracy of the translation achieved by our system justifies the hypothesis 
that word-forword translation might be a solution for MT of really closely 
related s. The remaining problems to be solved are problems with the 
oneto many or many-to-many translation, where the lack of information in 
glossaries and dictionaries sometimes causes an unnecessary translation error. 
The success of the system CESILKO has encouraged the investigation of the 
possibility to use the same method for other pairs of Slavic
 
References
 
Bemova, Alevtina and Kubon, Vladislav. (1990). Czech-to-Russian Transducing Dictionary. In Proceedings of the 13th COLING conference, Helsinki 1990.

Hajic, Jan. (1998). Building and Using a Syntactially Annotated Coprus: The Prague Dependency Treebank. In: Festschrifi for Jarmila Panevov~i, Karolinum Press, Charles Universitz, Prague. pp. 106---132. 

Hajic, Jan and Barbora Hladka. (1998). Tagging Inflective Languages. Prediction of Morphological Categories for a Rich, Structured Tagset. ACL Coling '98, Montreal, Canada, August 1998, pp. 483-490. 

Hajic, Jan; Brill, Eric; Collins, Michael; Hladka Barbora; Jones, Douglas; Kuo, Cynthia; Ramshaw, Lance; Schwartz, Oren; Tillman, Christoph; and Zeman, Daniel.  Core Natural Language Processing Technology Applicable to Multiple Languages. The Workshop'98 Final Report. CLSP JHU. Also at: http:llwww.clsp.jhu.edulws981projectslnlplreport. 

Kirschner, Zdenek. (1987). APAC3-2: An English-to-Czech Machine Translation System; Explizite Beschreibung der Sprache und automatische Textbearbeitung XII1, MFF UK Prague 

Oliva, Karel. (1989). A Parser for Czech Implemented in Systems Q; Explizite Beschreibung der Sprache und automatische Textbearbeitung XVI, MFF UK Prague
  
