Semantic based generation of Japanese German translation system 
- Result and Evaluation- 
K. Hanakata 
Institut f. Informatik 
University of Stuttgart 
Herdweg 51 
D-7000 Stuttgart I 
F.R. Germany 
A. Lesniewski 
Standard Elektrik Lorenz AG, 
Ostendstrasse 3 
D-7530 Pforzheim 
F.R. Germany 
S. Yokoyama 
Eiectrotechnical Laboratory 
Umezono, Sakuramura, Niihari 
Ibaraki 305 
Japan 
Abstract 
Project SEMSYN*** achieved a state where a prototype 
system generates German texts on the basis of the se- 
mantic representation produced from Japanese texts by 
ATLAS/It of Fujitsu Laboratory. This paper describes 
some problems that are specific to our semantic based 
approach and some results of the evaluation study that 
has been made by the Germanist group. 
I. Generation procedure in SEMSYN 
This section summarizes the SEMSYN genration procedure. 
Those readers who are more interested in the SEMSYN 
system are recommended to read our previous COLING84\[I\] 
paper or the paper submitted to this conference\[2\]. The 
generation process begins with the conversion of the se- 
mantic networks, each represents one sentence, into a 
so-called IKBS (Instantiated Knowledge Base Schema.) The 
IKBS is an instantiation of case or concept schemata de- 
noted by semantic symbols as nodes in the semantic net- 
work. A case schema contains three main description 
slots; a) roles of cases associated with the seman- 
tic symbol, b) transformation rules of schemata, 
c) choice of German syntactic realization schemata. 
Being triggered by the semantic symbols of the given 
network, IKBS specifies the best basic syntactic struc- 
ture associated with a German word by checking fillers 
of roles and converts them into functional roles within 
each German syntactic category. A German syntacto-mor- 
phological component called SUTRA-S\[3\] a extended ver- 
sion of SUTRA \[4\] generates German surface texts from 
the instantiated syntactic structure called IRS (Inst- 
antiaed, Realization Schemata.) 
Though English-like terms are used for semantic symbols, 
the choice of a German word associated with each seman- 
tic symbol and its syntactic structure very differ from 
the English corresponding one. 
It. Some problems of semanitc based translation approach 
There are some advantages as well as disadvantages of 
the semantic based approach, which we anticipated at 
the beginning of the project. Theoretically speaking, 
a reason why we adopted a semantic based approach againt 
the syntactic transfer approach is founded on the cul- 
tural difference and communication barriers between the 
two project groups that cooperate with each other to 
build up a translation system. Understanding the con- 
tent of the origenal sentence from the given semantic 
representation the generation group could express it in 
a way that is common in its mother tangue, relatively 
free from the syntactic restriction and lexical corres- 
ponding terminology. It is a well known fact that one 
language of a culture can only be interpreted and not 
literally be translated into the other languages of dif- 
ferent cultures, as it would be possible within the 
same cultural sphere. As the matter of fact we often 
took this advantage in our generation system. 
On the other hand, exactly this freedom turned out fre- 
quently to be a disadvantage on the generation side. 
Dealing with real data (titles of sientific papers in 
the field of information technology from the Japanese 
data base JOIS) we encountered new problems we didn't 
expect before and recognized the limit of our approach. 
In the following we describe some of these problems: 
(l~ion ~ese oriI~inal text 
We had also to come up with this well known problem such 
as lack of articles (definite or indefinite) and of dis- 
tinction between numbers (singular or plural.) for nouns 
as well as verbs. We embedded some heuristic rules in 
KBS and dictionary to add these syntactic features, if 
they must not be missed in the German text. There still 
exists deeper semantics which rules the decisions, but 
cannot be represented in general, except for very lim- 
ited cases. Heuristic rules are based on our ambiguity 
conservation principle, i.e. we keep the ambiguity 
of input text as much as possible to avoid any active 
selection of one alternative, that might lead to a wrong 
expression from the view point of the author of the ti- 
tles. Following examples show typical errors of numbers 
and articles generated by the present SEMSYN heuristics. 
They also illustrate how difficult it is to find a trade 
off between the ambiguity conservation and an active 
decision infered from the content: 
E.g. l:~J~I~£m$~-C0)i~0\[~7 4 ,y~7"In~Y~Zs0~ 
SEMSYN K~eneration: Die Verwendun~ yon kleinen 
Computern zur Durehfuehrung von g_rgssen~graphischen 
Programnlen 
(The application of ~nall_.cg..mRuter § for the execu- 
tion of lar ege~~ggramms) 
Comment: The author of the paper will discuss how 
to use a small computer to execute a very large graphic 
package, so readers may naturally assume one small com- 
puter instead of many small computers, though it is pos- 
sible to assume the latter. On the other hand, it is ge- 
nerally assumed that a computer processes many programms. 
For this reason the latter plural case is more natural 
than tile former case. However, it is a bad German to 
have neither a number feature nor an article as it is 
in the original text. 
E.g. 2: ~iI~I~50~Y p e I) ~---~/~ ~/O')~y)~ ~y\]-~< I/--~Y- 4 ~/~>~ 
SEMSYN generation: Die Entwicklung des Kerns 
brim Betriebssystem yore verteilten T~ 2 fuer real 
-time Anwendunge_n. 
Correct German: Die Entwicklung des Kerns eines 
verteilten Betriebssystems fuer Echtzeitanwendungen 
(The development of the kernel in the operating sys- 
tem of the distributed type for real-time applications) 
Conmmnt: It is assumed that the author developed 
the kernel of one distributed OS, instead of many dis- 
tributed OS, for many applications. 
2) ~IJit of~'unctions 
One of the hard problems we expected in our semantic 
560 
representation was the ambiguity in the coordinating 
conjunctions in an attributive context such as: 
<AP> A, B and C <PP>. 
E.g.3 
high speed bus, memory and switching in bit slice 
technology 
The scope of context could be made unique, if the se- 
mantic network could allow such a node which denotes a 
subnetwork. The following conjunctive subnetwork is 
classified into three basic cases: 
<arc I >----@ -----<en urn>-- ---~@---~arc2> 
(i) <arcl>-----(~ ..... <enum>--~------<arc2> 
E.g.4 . . efficient algorithm and computation for 
parallel processors 
(ii) ~ ~<arcl--> ~-- -A ~enula)- ..... ~---~\[---~arc2) 
E.g.5 . . new algorithm and computation by a vector 
processor 
E.g.6 ,. high-speed bus and memory management by stacks 
In practice, however, we found that 90% of about 380 ti- 
tles which contain conjunctions among 2000 titles we so 
far generated from the given semantic networks belong 
to the case (i); only about 8% are the case (ii), and 
the rest is the case (iii). This statistic results may 
be spesific for the titles, but this indicates that au- 
thors of titles are aware of the syntactic structural 
ambiguity and consequently try to avoid the above straight- 
forward sequence of conjunctions except for the case (i). 
Beside this statistic sample-based facts, the conjunctive 
ambiguity is further weakened by the fact that the ge- 
neration system produces ambiguous titles according to 
our ambiguity conservation principle to let expert 
readers naturally infer which is meant by the author. 
At the moment we deal with the both cases (ii) and did 
by exploiting this possibility to convey the ambiguity 
so as if it were the ease (i). 
Timugh this conjunctive ambiguity in semantic networks 
seemed to be a serious factor at our first glance on 
them, it fortunately turned out to be a very minor pro- 
blem as the evaluation study indicates. 
3) ~c_problem 
Generally speaking, a semantic based generation approach 
has a strong advantage as well as disadvantage ill terms 
of sentence styles. The stylistic advantage is based on 
the large freedom of interpreting a given semantic re- 
presentation. A serious disadvantage is the exactly the 
other side of this interpretation freedom. Following 
examples illustrate typical stylistic problems of our 
generation system: 
E.g.7 PD I L ~1~7'~ \]" ~ff)~}~ , ~_~I/--',%/=~I/ , 
SEMSYN generation: Spezifikation, Simulation und 
Entwicklung von Protokollel~ fuer die PDIL verwendet 
werden, fuer die Kommunikation 
(Specification, simulation and development of proto- 
kol, for which PDIL is applied, for the communication) 
Comment: "fuer die PDIL verwendet werden, ~{uer 
die Kommunikation." should be expressed as ".. sines 
Kommunikationsprotokolls unter Verwendung von PDIL." 
( ".. of a communication protocol by using PDIL", in 
stead of " for which PDIL...) 
SEMSYN generation: Die Repraesentation yon Infor- 
mationen, die ableitbar in einem Speicher gewesen 
werden. 
(The representation of informations, which can be 
derived in a memory) 
Comment: The clause " .., die ableitbar.." should 
be replaced by an adjective phrase 'iron der aus dem 
Speicher ableibaren Information". 
E.g.9 C 0 2 t/-~{L~;~C~ -~''N-~XS/~-~ 
SEMSYN generation: Die Verwendung yon Datenbanksys- 
temen zu einem Verfahren zur Aufstellung von Be- 
dingungen beim Verfahren zur Verstaerkung yon 
Oberflaechen des Kohlendioxidlasers 
(The application of data base systems for the list- 
ing of conditions for the surface hardening procedure 
with CO2 laser. 
Comlaent: Instead of repeating nominalized case 
frames for role purpose "Verfahren zur Aufstellung" 
and "Verfahren zur Verstaerkung" should the latter 
be expressed as "Oberflaechenverhaerterungsverfahren" 
Though bad styled expressions may transmit the correct 
meaning, they substantially reduce the understandability 
of the generated texts. The stereotypical bad styles can 
be easily improved ill some cases; however, the style con- 
version problems seem to have its inherent continuous 
depth from "easy to patch" to the infinite depth to be 
pursued in a long run. 
4) Cultural difference problems 
Before we started the project we discussed many problems 
that arc specifically attributed to the well known cul- 
tural difference. In the following given are some of the 
real problems we encounted in dealing with title trans- 
lations: 
i) Focus shift 
We have frequently to come up with the difference of 
focussing, that forces us in a conflict situation whether 
we should prefer fidelity of the translation to the com- 
mon style of German titles. 
E.g.lO ~1 ~ 7" la -~,,'X O) ~ (~ ~ ~ ~ ~ ~ ~ ~ ~ 
SEMSYN generation: Die Verwendung yon Semantiken 
zur Spezifikation von Kommunikationsprozessen 
(The application of semantics for specification of 
communication processes) 
Comment: The original Japanese text does not con- 
tain an explicit word that coreponds to the semantic 
symbol "USE.ACT", that is infered by tile analyzer. 
Generally speaking, however it sounds better in Ger- 
man if a expression explicates the meaning in a more 
resolved form, while ambiguous expressions or even 
fuzzy expressions are prefered in Japanese. In this 
example the purpose arc expressed as "zur" implies 
the application of the semantics. 
ii) Reversed causality 
Tile most striking case that exemplifies the opposite 
relation between east and west is the reversed expres- 
sion of causality, mostly would-be results are used in- 
stead of the cause in Japanese and vice versa in west. 
Following example demonstrates the fact: 
SEMSYN generation: Problems bei der Ausbildung 
ueber Lehrer, die spezielles Computerwissen besitzen, 
innerhalb sines Schulsystems. 
(Problems of training teachers, who own special com- 
puter knowledge, within the school systems) 
561 
Comment: Here the Japanese original text means 
that the special computer knowledge is a result of 
the training. If the teachers have already this spe- 
cial knowledge, they don't need the training. There- 
fore, it must be expressed as "so as to have .." 
At the moment neither our analyzer nor generator can 
afford such a deep understanding of input texts. Our 
approach is still open to enrich the TRAIN scheme to 
represent causal relation of the TRAIN concept which for 
ces to reverse the causality of given meaning. 
HI. Evaluation 
About 20% of the translation results produced from the 
available semantic networks are evaluated. In order to 
avoid the misunderstanding it is worth to make it clear 
that this evaluation was not done by the so-called blind 
test, instead, all semantic networks are already used as 
our training samples. This is because at the time when 
the evaluation study started we had only 2000 semantic 
networks available. The evaluation results are summarizes 
as follows: 
Grade Fidelity Grammaticality 
i 68.0% 16.7% 
2 29.7% 48.0% 
3 2.3% 19.0% 
4 0.0% 16.3% 
5 0.0% 0.0% 
Explanation: 
Grade Grammaticality 
i: Exactly the same meaning as the original text 
2: Almost same content 
3: Still acceptable and informative 
4: Only partially acceptable 
5: Nothing to do with the content of the original text 
Grade Fidelity 
1: Correct style, syntax and morphology 
2: Correct syntax and morphology, but stylistic 
defect and vice versa 
3: Still readable, but substantial mistakes in 
syntax, morphology and style 
4: Almost unreadable as German text 
5: Not German 
Based on this evaluation results we sorted our error 
sources. Following results show the error classification 
from which the readers can figure out the development 
state of our system. 
Error classification Occurrence among 300 titles 
Fidelity 
(a) Lexicon(not standard terms, inappropriate terms); I05 
(b) Selection of prepositions 89 
(c) Word construction (noun compounds) 88 
(d) Articles (def., indef.,) 50 
(e) Number (sing., plur.) 43 
(f) English terms (technical terms instead of German) 27 
(g) Relative clause instead of Np, PP 19 
(f) Focus 15 
(h) "Unter Verwendung von" 10 
(i) Possessive attribute "von" instead of genetive 8 
Grammatieality 
(i) Selection of preposit 34 
(ii) Word compounds 23 
(iii) Articles 14 
(iv) Relative clauses 12 
562 
(v) Focus 9 
(vi) Conjunction alignment 5 
(vii) Numbers 4 
(viii) Attributes 3 
The above classification indicates that dictionary pro- 
blem cannot be solved in a short term. Especially in our 
approach, a semantic symbol generally corresponds to an 
upper concept, under which an appropriate German term is 
registered as a specialization. Therefore the terminol- 
ogy selection within a lexical entry is indirectly done 
through its context. Again, this very advantage of expres- 
sion freedom causes a bad selection of a target word. We 
need time to polish our semantic German terminology data 
base so that system can select right German words in ge- 
neral. 
The noun compound is a specific problem in German. By 
constructing a noun compound a stylistic problem may 
elegantly be solved (cf. e.g.9, I0), because otherwise 
using a modifier (possessive attributes, qualifiers and 
quantifiers, etc) results in an awful expressions that 
can not be compared with an alignment of English terms. 
We also found a conflict situation in connection wilh 
the selection of technical ~t~'ms. While we prefered com- 
mon English technical terms in the field of informalion 
processing as CS experts for tlle reason of easy under- 
standing, evaluators emphasize the authority of nation- 
al standard technical terms (DIN), e.g. CRT (Datensicht- 
geraet), real-time (echtzeit), etc. 
The reason why the German ideom "unter Verwendung yon" 
was frequently used can be attributed to the semantic 
symbol "USE.ACT", often infered (about I0%) by the 
analysis system. (Note: USE.ACT covers "verwenden (use)", 
"anwenden (apply)", "Gebrauch machen (make use of)", bul 
also <instrument> arc for "mit (with)", "}nit Hil£e yon 
(with the help of)", etc). This means that the explica- 
tion of USE.ACT of an implied meaning in the original 
Japanese text may either elucidate the situation in Ger- 
man (this is often the case) or make expression harder. 
By the same token a postpositional phrase or adjective 
phrase of an original text may awkwardly be expressed in 
a German relative clause. As the modifier and USE.ACT 
cases above mentioned, exemplify the situation, the over 
analysis and over-expression are specific to our seman- 
tic based approach and could be avoided in other trans- 
fer approaches. 
IV. Conclusion 
We discussed some problems of our semantic based approach. 
Many of them are also common to other aproaches. However, 
our approach seems to be open for continuous improvement 
in dealing with these problems. 
We express our sincere thanks to the ATLAS/It group 
of Fujitsu Laboratory, Kawasaki for making semantic re- 
presentations available for our generation 
References
\[l\]Laubsch,J.,D.Roesner,A.Lesniewski,Hanakat a,K.: "Lan- 
guage generation from conceptual structure: Synthesis 
of German in a Japanese/German MT project", in COLING-84, 
Stanford, 1984 

\[2\]Roesner,D., Hanakata,K.: "When Mariko talks to Sieg- 
fried"; submitted to COLING-85 Bonn, 1985 

\[3\]Emele,M.,Momma,St.:"SUTRA-S; Erweiterungen eines Ge- 
nerator-Front-Ends fuer das SEMSYN Projekt, Studien- 
arbeit, Inst.f. Informatik, Univ. Stuttgart, 1985 

\[4\]Buseman,S.. "Oberflaechentransformationen bei der auto- 
matischen Generierung geschriebener deutscher Sprache", 
Diplomarbeit Univ. Hamburg, Fachb. informatik, 1983 
