Czech-%o-Russlan Transducing Dictlo~ary 
Alla B6movh, ¥1adlslav Kubon 
NA~ - lln~ui~tles 
Charles Unlversi%y 
Malostransk~ m~m. 25 
CS-118 O0 Prague I 
0. A bottleneck of all produetional-oriented 
machine translation systems is to handle 
those words which are not included into 
main dictionaries of the particular system. 
In this paper we want to describe one 
possible approach to this problem, based on 
the idea of the so-called transduetlonal 
dictionary. The first part of the paper is 
devoted to the results of our empirical 
inquiry into the problem in closely related 
languages, the second one to the description 
of the implementatio" of the transducing 
dictionary in the Czech-to-Russian machine 
translation system RUSLAN (Hajji, 1987). 
1.1. The basic idea of the transducing dic- 
tionary, originally developed by Z.Kirschner 
for the Engllsh-to-Czech NT system APAC 
(Kirschner, !987, l{ajiSov6 and ~irsehner, 
1987), is quite simple. Among the words 
which failed to be found in the basic dic- 
tionaries of the system, are some special 
sets of words, which are very similar in 
both languages. The goal of the transducing 
dictionary is to translate those words pro- 
perly according to the rules, dealing with 
regularities between source and target shape 
of them. This way of handling lexicon as an 
open system differs substantically from the 
traditional one. It is very useful especia\]- 
ly in the translation of the scientific or 
technical texts where many international 
words are used. 
The transducing dictionary (TD), origi- 
nally used for ~T between English and Czech 
languages, handled mostly international 
(Greek-Latin) words. We have assumed that 
there will be a greater number of lexico- 
derivational parallels between two closely 
related languages and that they will contri- 
bute to a successful analysis to a greater 
extent. This assumption turned to be false. 
1.2. Let us now turn to some issues of a 
contrastive analysis of Czech and Russian to 
investigate, where the idea of TD could be 
applied for the translation between these 
two languages. 
For instance, the Czech ending -ace is 
translated as -acija, as in modlfikace - 
modlfikaeija "modification" etc.; thus, e.g. 
if the Cz. word "emulace" is not found in 
the main dictionary, it is translated by the 
transducing rule as "emulaeiJa". A word 
translated in this algorithmic way may be 
subdued to the procedure of orthographic 
changes and modifications, in which some 
rules of Cz.-R. alternation of graphemes are 
applied, as, e.g., ~--~e, h--~g, y~i , x-~ks, 
au-~av, ou-~u, the initial e-~e, etc. 
It goes without saying that such a 
transduction is accompanied by a risk of 
creating errors or is impossible, e.g., in 
case that the output language lacks such a 
direct derivational variant. However, even 
in such cases it is possible to apply along 
with the transducing dictionary further 
"emergency rules" and to determine, on the 
basis of the final segment of the wor~, at 
least the type of the word and its grammati- 
cal characteristics, and thus to prevent the 
termination of the analysis of the whole 
sentence due to a single non--identified 
word. Of course, every irregular lexical 
correspondence can be included in the main 
dictionary. 
1.3. The rules of TD, as indicated above, 
can be used with several degrees of comp\]e-- 
xity: from the possibility to translate 
"mechanically" without any changes or only 
with some orthographic modifications as in 
elektroskop - elektroskop - "electroscope", 
through the necessity of certain modifica- 
tions of the word-formative segment or en- 
ding according to the needs of the target 
language as in algoritmus - algoritm 
"algorithm" to the limitation to a mere 
identification of the word as for its word 
class, or, as the case may be, its morpholo- 
gical or semantic characteristics as, e.g., 
with tla~Itko - knopka "push-button", where 
the ending segment -tko points to the noun, 
neuter, ~enoting an instrument (means). 
314 
According to these degrees, the set of 
derivational morphemes productive for the 
domain of computer techniques can be classi- 
fied into several groups; not always we are, 
of course, concerned with a suffix in a 
strictly linguistic specification of the 
term; rather, we deal with a segment which 
has for the given pair of languages the 
required properties (it may consist of two 
suffixes or their parts, as, e.g., in -kce, 
-uee, and also of an original word stem, as 
gram; we quote here segments in the form of 
nominative sing., i.e. together with the 
inflectional ending). 
1.3.1. The most substantial part of the 
first group consists in words denoting the 
names of machines, devices and instruments, 
and also some names of agents of events, of 
some properties and processes. These words 
can be translated to Russian by a mere 
transcription. The following suffixes are 
members of this group: 
-skop, -metr, -tot, -graf, -tron, -fen, -at, 
-log, ~ant, -ent, -tufa, ~ika. 
I.\[~.2. The next group contains nouns and 
adjectives the form of which must he modi- 
fied according to models customary in the 
target language, which involves some minor 
changes in the derlvational segment or en- 
ding. The respective suffixes are: 
-smus/-zm, -ace/-aciJa , -kce/--kclja , -use/ 
-uelJa, -ance/ -anoiJa, -ence/-eneiJa, -ie/ 
-iJ~, -gram/-gzamma, -aze/-azis,-~eze/-ezls, 
-Ita/-ost', -ium/-lJ, -ista/-ist. 
Certain regularities of translation of 
derivatlonal suffixes carl be observed also 
wlth adjectives, which form an important 
part of terminology, of. the following 
correspondence: 
~Ick@/-i~eskiJ , -~rn~/-arnyj , -~ln~/-a~nyJ , 
-antnl/-antnyj, -entn~/-entnyj, -~nl/ 
-clonnyJ, -ivn~/-~vnyJ. 
1.3.~. The third group includes semantically 
uniform and productive classes of words, as, 
e.g., deverbative nouns ending with -ani~ 
-end, which denote in Czech a process or its 
result (kodov~n~ - encoding, spou~t~n~ - 
switching on ), nouns in -ost denoting a 
property (pra~nost - dustinnes, vlhkost - 
humidity), nouns in -tel denoting an animate 
or inanimate actor (odh~ratel - consumer, 
ukazatel - marker) and also nouns in -~t~ 
with the meaning of a certain ~lace or space 
(nalezi~t~ - deposit, praeovi~%~ - working 
place) and nouns in -tvl with the meaning of 
a feature, workshop or property (mno~stvl - 
a number of, pekagstv~ - bakery, bohatstv~ -- 
richness). All these word-formative types 
have a corresponding equivalence in Russian, 
with minor modifications. However, in the 
course of the long-term development, the 
semantic shifts of the word basis prevent 
the possibility of the translation of these 
types only by means of the word-formation 
correspondence of the transducing dictlona-- 
ry. TD cannot do here more than specify the 
word class or the gender. These data can be 
used in the syntactic analysis of the source 
language but in the D. output the word has 
to be marked as not boeing found in the 
dictionary. 
I.~.4. There is one more group of words we 
encounter in technical texts which is pro- 
ductlve in Czech and has a clearly specified 
meaning. There belong the nouns in -6 (- 
i~, -a~, -e~): chladi~ - cooler; these nouns 
denote first of all a device or instrument, 
sometimes also an animate actor. The suffix 
has no word formation parallel in R. and is 
substituted by other suffixes, or frequently 
by a complex naming unit: p~eklada~ - 
translJator "translator" or ptep~na~ - pere- 
klju~ate~ "switch". A similar situation oc- 
curs with the suffixes -~tko, -~tko, -dlo, 
-~rna, -ovna. Also there suffixes have no 
corresponding equivalent in P. 
In spite of the fact that ~n this 
group, the derivatlonal means determine 
classes charaeteristlc both as for their 
meaning and form, this fact is made use of 
onl; partially - the transducing device 
easily and correctly identifies their word 
class and can asslgn to them morphological , 
or as the case may be, semantic characteri- 
stic. 
3.1. The implementation of the trans~uclnK 
dictionary was to a great extent influenced 
by the programming language used, namely the 
systems Q (Colmeraurer, without date) in 
which the whole system RUSLM~ is written. 
Systems Q provide means for working with 
grammar directly in the form of rewriting 
rules, which are realized by means of a 
labelled graph; the use of a rule means 
adding a new edge. In order to limit the 
number of rules, the programme (i.e. the set 
of rewriting rules) is divided into phases. 
At the end of each phase , the redundant 
edges are removed. The whole system HUSLAN 
consists of 16 phases. 
The TD does not immediately follow the 
main dictionary (i.e. the dictionary of 
2 315 
stems) because the mechanism of morphologi- 
cal analysis cannot distinguish at that 
point, which words have been found in the 
dictionary and which have not. This is due 
to the fact that the identification is given 
by non-empty intersection of the pattern of 
the stem and the patterns that come into 
consideration as the patterns of the ending; 
this intersection is identified by the phase 
immediately following the dictionary look- 
up. In the next phase, a preliminary syntac- 
tic analysis is carried out, in the course 
of which the unambiguous words found in the 
dictionary and immediately depending on each 
other are put together. At this point, the 
rules of the TD are applied. 
The forms which have not been found in 
the main dictionary pass through the phases 
unchanged. In the first phase of the TD two 
kinds of rules are applied - the first one 
dealing with the groups from 1.3.1. and 
1.3.2. and the second one with the rest 
(I.3.3. and 1.3.4.). The only difference 
among those two types of rewriting rules is 
that the second one does not transcribe the 
Cz. lexical value of the word into the R. 
Both add morphological characteristics and 
se,~antical features to particular endings. 
The first phase of %he TD contains also 
the first portion of rules rewriting the 
result of the application of the above men- 
tioned rules (of the first type ). The 
following transcriptions are carried out : 
hi->i, x-~ks, hey (at the beginning of the 
word) -~ev, ou~u, hy-~gi , all long vowels 
are changed to short, 9~r, ~e, d~d6, 
t'~t6, n-~n6 ( 6 is a code for the R. soft 
marker) , gy-Pgi , ky~ki. In this way, the 
original Czech stems (to which the ~. en~ 
dings are attached at this point) are trans- 
cribed into the corresponding R. forms. 
The second phase of TD is added in 
RUSLAN to the phase of Czech analysis in 
which the disintegrated stems are put 
together again. This phase of TD has two 
main tasks: 
(a) to finish the transcription of the Czech 
stems Into the Russian ones, which for tech- 
nical reasons could not be done in the pre- 
ceding phase, 
(b) to modify the form of representation in 
such a way as to be consistent with the 
words that have been found in the 
dictionary; the rules remove the auxiliary 
symbols and put together disintegrated ~us- 
sian lexieal unit. 
The third phase of TD processes a rat- 
her small set of words which have not been 
found in the main dictionary and for which 
none of the rules of the preceding phases of 
TD can be applied. This unrecognized words 
are great obstacle for the build-up of the 
dependency tree because they do not allow 
the parts that stand to their left and to 
their right to be put together in the course 
of the syntactico-semantic analysis. As a 
temporary solution of this problem, tho 
rules of this phase rewrite all hitherto 
unidentified words as nouns in neuter sing. 
For a more reliable solution, the work is in 
progress to distinguish between the endings 
in a more subtle way (e.g. those forms en- 
ding in -~ should be assigned the word class 
of adverbs, etc.). The default procedure of 
the third phase of TD arrives at a dead end 
in case the unidentified word is a verb, 
since then the sentence lacks its key ele- 
ment - the root of the tree. 
It is evident that none of the g~liding 
lines applied in the TS in its present form 
is 100% valid. However, it has a great value 
for the experlmental testing of the system 
because it filters out words that have to be 
added to the dictionaries of the system. 

References

Colmerauer A.: "Les systemes Q ou un forma- 
lisme Dour analyser et synthetiser des 
phrases sur ordinateur". Mimeo, without 
date. Montreal. 

HaJi~ J.: "Ruslan - an MT System I~etween 
Closely ~elated Languages, In~ Pro- 
ceedings of the Third Conference of the 
European Chapter of the Association for 
Computatienal Linguistics, Copenhagen 
1987, p.I04 - 108. 

HaJi~ov~ E. - Kirschner Z.: "Fail-Soft 
("Emergency") Measures in a Production- 
Oriented MT System", In: Proceedings 
of the Third Conference of the European 
Chapter of the Association for Computa- 
tional Linguistics, Copenhagen 1987. 

Eirschner Z.: "APAC3-2: An Engllsh-to-Czech 
Machine Translation ~ystem". Expllzlte 
Besehrelbung der Spraehe und automati- 
sche Textbearbeitung XIII, Prague 1987. 
