THE JaRAP EXPERIMENTAL SYSTEM 
OF JAPANESE-RUSSIAN AUTOMATIC TRANSLATION 
Larisa S. Modina, Zoya M. Shalyapina 
Institute of Oriental Studies, Russian Academy of Sciences, 
Rozhdestvenka str., 12, 103753 Moscow, Russia 
Abstract 
The paper is the first report on the experimental MT 
system developed as part of the Japanese-Russian 
Automatic tra.aslation Project (JaRAP). The sys- 
tem follows the transfer approach to MT. Limited so 
far to lexico-morphologieal processing, it is seen as 
a foundation for more ambitious linguistic research. 
The system is implemented on IBM PC, MS DOS, 
in Arity Prolog (analysis and transfer) and Turbo 
Pascal (synthesis). 
1 Theoretical background 
The development of the Jall.AP experimental sys- 
tem was preceded by a long period of purely 
theoretic research into various aspects of natu- 
ral language ~'td its functioning in translatio~t 
(see, e.g., (Shalyaplna~lO80a,1980b,1988)). Some of 
the basic principles which have evolved from this re- 
search may be summarized ~m follows. 
(1) The most adequate scheme for simulating hu- 
man translation ~ctlvity is doubtless the transfer one. 
(2) The level of transfer and the volume of struc- 
tural and semantic information explicitly represented 
at this level should be determined experimentally as 
a compromise between the demands for translation 
adequacy under the given conditions and the advan- 
tages of "shortcuts" permitted by the superficial cor- 
respondences between the languages concerned. 
(3) Semantics is not in itself a level of linguistic 
representation, but rather part of linguistic descrip- 
tion at any level of representation of linguistic units. 
(4) In its semantic aspects, syntax is dependent on 
lexicon to a greater extent than vice versa. 
(5) A model aimed at faithful simulation of lin- 
guistic performance should make explicit use of the 
factor of linguistic normativity, this being, at least in 
prospect~ a building block for "self-tuning" functions 
as an analogue for human learning capabilities. 
An approach best suited for effectuating these 
principles seeirm to be that of relying on a lexicon- 
oriented lingware framework of a special kind. 
Within this framework, eTttrles of a uIfiform struc- 
ture may be provided, besides lexlcal units, also for 
morphological categories, fanctlon elements (inchd- 
lug punctuation), and all kinds of grammatical fea- 
tures, while syntagmatics of all levels may be pre- 
sented in terms of valencies of those levels, assigned 
to the corresponding lexical or grasmn.atlca\] units in 
their entries. 
The JaFtAP experimental system is meaatt to in- 
corporate this approach. 
In ~cordance with the transfer scheme of transla- 
tion, the system is made up of three major com- 
ponents: the Japanese analysis component, the 
Japanesc-Rl~ssian transfer component, and the Rus- 
sian synthesis (generation) component. It is imple- 
mented on IBM PC, MS DOS, its programming tools 
being Arity Prolog for analysis and traamfer~ and 
Turbo Pascal for synthesis. 
2 The current version of the 
JaRAP system 
At present, the JaRAP system does not go far beyond 
the i~fitial lexico-morphological level of text process- 
ing (though some provision has already been made 
for further stages of its development - see See.3). 
The analysis component of the system performs 
so far three main groups of operations: segmen- 
tation of the input Japanese texts into graphico- 
morphological (CAM-) elements (stems aJtd suffixes 
of Japanese words); processing s\] tranMationally id- 
iomatic (TI-) csm6inatisns of GM-elements; and 
lezieo-rnsrpholofical (LM-) analy~i.¢ of the resulting 
sequence of (\]M-elements aatd the_Jr TLeomblnations. 
Segmentation is accomplished in two steps. 
First, the input text (= the input sequence of k~na 
and kanjl kodes) is broken up into fragments by con- 
teztual delimiters eertMn to denote word or morph 
boundaries (e.g., punctuation marks, the occurrence 
of a k~tat~na symbol after a hlragana one or vice 
verB% etc,). Then the fragmgnts obtained are seg- 
mented into GM-elements by mea~ts of dictionary 
1/2 
search. The resulting GM-elements are represented 
by the reference numbers loc~tlng their dictionary 
entries in the database used. For segmentatimtally 
ambiguous fragments, all possible segmentations are 
formed. If dictionary search is unsuccessful, the pro- 
gram draws oa ant auxiliary index of separate graphic 
symbols, so that "unknown" words can still be pro- 
cessed (nard if they axe composed of "kaatjl, be even 
provided later on with a traatslatlon of sorts). 
The processing of TI combinations of GM- 
dements is partly necessitated by the fact that frag- 
ment boundaa'ies may sometimes sepszate the com- 
ponents of a. eornpottnd word~ llke 
so that these compoitents have then to be joined to- 
gether by a special procedure. The saane procedure is 
used to locate multl-word comblaatloits similar to tin- 
gle GM-elements in that they have idioms.tic trans- 
latlons ~nd do not allow of variations in their htter- 
nal structure (this is often he case with terminologi- 
cal expressions). TI-combint~tlons axe searched for as 
sequences of reference numbers identifying the GM- 
elements they ~e composed of. When. found, they 
are replaced each by a single reference number - thu.t 
of the entry for the TI-combinatlon as a whole, ~.nd 
are subsequently treated in the sumac way as individ- 
ual GM-elements (with some reservations mentioned 
in See.3). 
LM-analysls of a sequence of GM-elements ex- 
amines, for eazh of them, all of its alterna.tive lexico- 
morphological interpretations, or LM-elements con- 
tanned in its entry, with the aim of integnging the 
LM-elements corresponding to adjacent GM-elements 
into accepta.ble morphologtlcsl (M-) representations 
of Japnmese word-forms. The acceptzdfillty of these 
is established by checldng each M-representa-tion, as 
somt as it is formed, for the co-occurrmtce restrk:- 
tions its elements may impose on eax:h other t~nd 
oft the elements of its immediate eoittextual neigh- 
bours. Tlfis also serves for disaznbiguation, as all the 
LM-elements that cannot be used to form a.n accept- 
able M-represent~rtion of a word-form in the given 
sequence of GM-elements, are filtered out. 
To optimize processing where aiterna.tive pa.ths of 
analysis are concerited, all analysis procedure, a.re 
orga.nlzed so as to lhnit sep~r~tte processing of such 
alternatives only to the subpatht; responsible for the 
ditferences between them. If some subp,~th is the 
same in two or more of the alternative axtalyses, it i~ 
processed just once, and the result is used for all the 
corresponding Mteraatives. 
The bulk of the morphologleal deserlptlon nsed 
in LM-analysis is of a valeney-~ased type (an excep- 
tioa being the morphonologica\] - or, r~ther~ morpho- 
graphical - alternations: t.he 10 metarnles represent- 
ing such altentations are incorpor~Lted in the segmen- 
tation procedure). The morphological valencies are 
mostly assigned to suffixes, while stems (verbal or axL 
jecfival) ~t as fillers. The co-occurrence restrictions 
imposed by the eh'.ments of ~ word-forra on those of 
its adjax:ent word-forms are described in much. the 
same way (the only difference bei\]tg that irt this ca~e 
the data. to be checked is assigned to stems at least 
~L8 often as to sufi'ixes). This helps to mt~ke word- 
bonnda.ries tra-nspa.rent, if necessa-ry, to morpholog- 
ical valencies, so that the borderline between mor- 
phology a.nd syntax loses something of its tradltlona\] 
r~ghtity. 
Transfer operatioss s.t I, he lexico-morphologlcal 
level ~tre limited ~Lt present to those of repbLcing tit<.* 
elements of the Jn.panese M-representation obtained 
from analysis, by their Russian equivMents, aatd shift- 
ing, where necessaxy, the ttussi~n morphological cat- 
egories that msy appear a~ a result of such replaxee- 
meat, from the positions they initially occur in to 
tkeir appropriate word-forms. Sometimes this in- 
volves skipping a. uumber of intermediate elements, 
such as a.uxilia.rles, brackets, etc. 
Besides lexieo-morphological transfer, we ha.re 
by now implemented some very simple synts.cfieal 
a.n~ysis-and-trv.nsfer opers.tions based on the most 
general correspondences betwemt Japanese and Run- 
sian structura.l and word-order information. This is 
oitly the very first step to the synta~ctlea\] tr~itsfer 
component we are planning, but the operttt\]ons im- 
plemented ~re already suttleient to provide a,tcquate 
l:tusslan translations for Japanese sentences contain- 
lag no embedded clauses, lexie,'d ~mbiguities, or other 
difficult linguistic phenomenu. 
Thus, the smtteltce: 
::~: ( o> A. l,: al~/,'~,\]{ '( gg 3. 
Nichi-ro kikai hon'yaku ahi~uternu wa 
ooku no hits ni hitsuyos de aroo 
is translated s.q: 
(~L4C'I'eM~I, a.lll0HClgo-pyce'14.ol'o M}IIII14II}IOI'O 
llepelt(),/\[~l, .II~'IJ\[$\[(!'PL'yl) IIO-BI4\]I~I4MOMy} 
Heo6xoJ~,HMO~ MHOI'I4M J!.IO,/\](JtM. 
The information database used in the analysis 
and transfer procedures is organi~,ed as an indexed 
llst of dictlon~ry entries for individual GM-elements, 
TI-comblnations of GM-elernents, and grammatical 
features ((:lasses) of LM-elements. To speed up (lle- 
113 
tionary search~ the database is provided with ~n in- 
dex organized as a superposition of b~lanced trees. 
Ewch entry (presented in the database by ~ Pro- 
log term) constitutes a list of entry zones confined 
e~ch to one type of linguistic information. A sep- 
axate zone (identified by the corresponding label) 
is used to specify, e.g., the graphical representation 
of the GM-element described, its structural (lexico- 
morphological) representation; the llst of its gram- 
matical maxkers; each type of restrictions imposed 
on the elements filling its morphological w, deneles, 
etc. The overall set of entry zones is the same for all 
types of entries, though e~ch entry cont~ns only the 
zones relevant to the element described. 
At present, the d~tabase includes over two thou- 
sand entries. 
Special emphasis has been placed upon providing 
the system with e~cient means of updating linguis- 
tic information. The environment built for this pur- 
pose is called VOCOPS ("VOCabulary updating OP- 
tionS"). 
The VOCOPS environment allows the user to 
add, delete or replace all types of dlction~ry entries 
or zones within them in a highly interactive mode. 
VOCOPS checks the updating information for its for- 
real accuracy and for its compatibility with the in- 
formation already contained in the current databa.se. 
It then proceeds to waxn the user of those con- 
sequence~ of lds updating operations which other- 
wise might have been overlooked, and to indicate 
the in~cur~ies or inconsistencies detected. If pos- 
sible, it also suggests the likely ways of their cor- 
rection. Among other things, VOCOPS keeps watch 
on the correspondence between the entries for indi- 
vidual GM- (and LM-) elements ~d those for their 
TI-combinatlons. E.g., if the user wishes to delete 
a GM-element which forms pazt of some of the TI- 
combinations present in the database, VOCOPS lists 
these with a warning that they will also be deleted. 
The Russian synthesis component is co~l- 
strutted as an independent subsystem, cornplete with 
a database of its own. Its functions include both mor- 
phological generation and some tmpecta of syntactic 
processing. Here we will not discuss it an any length, 
because there is ~ sepaxate paper devoted entirely to 
this component (Kanovich,ShaJyapintr,199.1). 
3 Development work under 
way 
Implementing the most basic (however simple) of the 
linguistic functions needed in translatior h the current 
version of the Jal'tAP system constitutes the neces- 
sary foundation for further developments. Both its 
database and its programming software are struc- 
tured to a~cept any new components (new zones of 
) 
4 
the dictionary entries, new progra.ms, tetc) without 
impairing ~.hose ~Iready functioning..'\]:he VOCOPS 
updating subsystem is also general enough to be e~- 
ily tuned up to new types of linguistic d~t~ as soon 
as they are included in the system. 
Moreover, even in its present form, the JM~.AP sys- 
tem comprises some specific features mea~tt for more 
adwnced lingulstic processing. 
Thus, aznong the grammatical markers assigned to 
the Japanese LM-elements in the current database 
axe u number of those to be used in syntactical anal- 
ysis. 
Entries for TI-comblnations of GM-elements in- 
clude specification of their syntactically ~d seman- 
tically dominant components, for use in processing 
paxallel constructions and a~taphor~. 
The list of the Russian equivalents for an LM- 
element includes, wherever desirable, different parts 
of speech, the choice between them to be effected by 
the syntactical tra~tsfer. 
The synthesis component is designed to ~cept 
syntactic',dly weighted representations of Russian 
word-forms, etc, 
Now that we have built the ba.qic groundwork, 
labor-consuming as it is, we are taking up these, more 
ambitious tasks. 
As the Japanese-Russlan pa~r of languages is vir- 
tually unexplored in its machine-tra~tslation perspec- 
tive, our immediate off'errs are being focussed on de- 
termining the reasonable minimum of grazmnatical 
knowledge of Japanese necessary for obtaining intel- 
ligible Russian output for unadapted (us-pro-edited) 
Japanese input. 
References 
\[1\] Kanovich, M.I., Sha\]ys.pina, Z.M. (1994) The RU- 
MORS system of guaslam synthesis (submitted 
for COLING 94). 
\[2\] S\]~s.lya, plnu, Z.M. (1980), hutomutie (;ranshJtiou 
as a, model of the human trmnsla.tion activity. In- 
ternational Forum on Information and Documen- 
tation, v.5~ No.2~ p.18-23. 
\[3\] Shalyaplua, Z.M. (1980). Problems of formal rep- 
resentatlon of text structure from the point of 
view of automatic tr~nslation. I~t COLING 80. 
Proceedings of the 8th Intcrnatlonal Cort\]erence on 
computational Linguistic. Tokyo, p.174-182. 
\[4\] Sha\]yapina, Z.M. (1988). Text ~ ~n object of ~u- 
tomatic translation. In T~kst i persvod, Moscow, 
p.113-129 (in Russia.n) 
114 
