English - Malay Translation System : A Laboratory Prototype 
TONG Loong-Cheong 
Computer Aided Translation Project 
School of Mathematical and Computer Sciences 
Universiti Sains Malaysia 
11800 Penang, MALAYSIA 
Abstract 
This paDem ~esents tim remf\[ts obtained by an English to 
Malay camputer translation system at the level of a lab~mat~y 
prototype. %~le translation output obtained for a selected text 
(secondary school 6~\]e~ist~y textbook) is evaluated using a 
grading scheme based on ease o£ post-editing. The effect of a 
clmnge in area and typology of text is investigated by c~paring 
with the translation output obtained for a University level 
Cc~iputer Science text. An analysis of the p~ohle~s which give 
rise to incGrTeet translations is discussed. %~ds paper also 
~vides statistical infcmmation on the English to Malay 
translation ~st~u and concludes with an outline of further wc~k 
being carried out on this system with the &Ira of att&ising an 
industrial prototype. 
i. The Eng!\[sh t_qMal_a~franslationSsSSSSSSSS_2~trm 
Baak~reusd 
Computer Aided T~anslation (CAT) research at Universiti 
Sa~m MalsysL~ (USM) began in 1976 as an individual research 
effcet. However, at that time, the work is more appropriately 
classified under hat, real language data ~cessing, including 
topics such as 'istilah' (temdnalogy) information retrieval, 
,Malay ~otf(~m extraction, parsing of Malay sentences using 
context-free g~sn~s asd Malay language teaching tools \[Tong 78, 
Chang 78\]. 
In 1978, research into CAT was initiated, and by 1979, the 
researchers a~ U~4 began to develop g~mmmr medels for F~qglish to 
Malay translation using the software tool ~ \[GETA 78\]. 
In 1980, a natior~l wc~kshop was conducted in USM, where a 
pilot English to Malay tr~uslatlon system was desmnstrated. 
Financial sup~mt beckons available, and li~ther development on 
the basic translation model was ca~ied out \[Tong 82, van Klinken 
84, zsharin 84\]. 
In 1984, a per~nanent Computer-Aided-@ranslation Project unit 
was set up at U~4, and full-time research staff were assigned to 
this project. Members of this project group now include t~o 
computer scientists, one linguist, and five lexicographer / 
edit~ / te~minologlst. '\[his group was assigned the task of 
producing a labm, atory ~ototype for Englisll to Mal~ 
translation, and the result of their efforts is presented in this 
report. 
S~stem Envlro~ment 
The AK\[ANE system .is an integrated software environment for 
cemput er~alded-translat lon, including tools for compiling 
grammes and dictionaries, and fer processing corpus of the 
source and tm~et texts. The CAT concepts beldnd this system is 
well-known and weE-documented \[Boltet and Vauquols 1985\]. 
This softwaPe has been prog~ammd using different levels of 
computer Isngnmges, from IBM assembly (PL360) to PL/I, and making 
extensive use of system tools of the IBM VM/CM~ ~stem - XEDIT 
and EX~. (~e of its advantages is efficiency (as cemDared to 
other similm, aystems), which means that it can execute with 
reasonable speed even on a combatively sa~ll computem system. 
USM's experience ~ith the ARIANE system ires been vary 
satisfactery, and we doubt very much .if another ~ystem cot~ld have 
been mdgrated asd utillsed at this University with similar 
success. Althou$~ theme lind been s~me criticisms about ARIANE in 
the literature, our experience has sho~ that insplte of its 
recognised weaknesses and drawbacks, it remains an extr~ly 
powemgul and practical set of tools f~r the development of CAT 
systess. Of course, the methodology pioneered at GETA \[Vanquois 
75\] has been incorF~mated into ninny 'new' systems today. 
On the physical side, the ARIAN~ system itself occnpies 
about 8 Mbyte of secondary storage, while the usem n~cldne 
requiwes m\]othem 5 Mbyte for storing the linguistic data (grammar 
models and dictionaries, but not including the source and target 
texts and their intermediate :~esults). A vi~tlml m~movy size of 
2 Mbytes is used f~. the execution of all the trasslations f~n 
~qglish to Malsy desc1~bed in this rep(mt. 
Translation Model and Executinn Time 
The ~%glish to Malay translation system consists of three 
main dictionm,ies -source English, Engllsh-Malay transfer, 
target Malay - and i~Ive gr~mmr models. The size of these various 
components are as follows: 
Dictior~ries: 
Sotu.ce lexieals: 5,000 
Target lexicals: 4,000 
Grssmar models: 
(11,0o0 w~de) 
( 9,000 w~de) 
lines rules 
nDr~0hologIcal analysis 600 90 
s tructun~al analysis 5600 300 
structural transfer 800 47 
structural generation 1700 120 
marphuloglcs/ generation 900 120 
The executinn time for translation is estimated at 1.0097 
Mild (million of instructions per word). This is consistent with 
times measured at GhTA, Grer~:)ble \[Boitet and VauqtDis 81~\]. In 
prastlcal terms, tlda means that on U~4's I~4 4381 system 
(estimated at 2.1 MIPS), the. translation time is approx~l~tely 
0.48 second of ~.h~tual CPU tJ,le per word. This fignme is based 
on the translation time for about 3,000 wn?ds taken i~n the 
selected text. The Imoportionate time for each I~se of the 
tranalation r~ocess is as follows: 
percent 
mo~phologlcal analysis 0.33 
structural ~ma3.vsis 55.21 
lexical transfer 0.44 
structumal transfem 11.34 
stractural generation 31.47 
morphological generation 1.21 
From the above, it can be seen that the three dictionary 
retrieval phases togethem account for only 2 % of the time, while 
the struct~.al ~m/ysis plmse, used up more titan half the total 
tame, with the l~st taken up ~ the structural ganeration (about 
one-tldrd) and the structural tr~sfer phases. %TLis result is 
639 
again consistent with those for other translatienmodels at GETA, 
Grenoble. 
2. The Qualitz of Translation 
Gradin~ Scheme 
In orde~ to assess the 'quality' of the translation output, 
a grading scheme (from grade A to grade F) was devised using a 
sentence as the benndary of assessment. This scheme is based on 
the ease of posfi-edit~ig the translation output, and not on the 
quality ur standard of trsaslation in the inure usual sense. 
Currently, there is no estahlished method of evaluating 
ccrnputer-alded-trasslatlon or mechanical translation output. 
Fase of post-editlng is a measure which also takes into account 
the ease of understsnding as we\]/ as the accuracy of translation. 
Two impurtant fact,s which affect say grading scheme is the 
typology of the source text itself asd the expert knowledge of 
the evaluatur in that particular area of text. Some method of 
evaluating the ease of undurstandlng of the source text and scme 
definition of a neutral evaluator are ~urequlsiten to shy 
stasdardlsed evaluation scheme. 
%~le grading scheme proposed in this report is a measure of 
the time required to edit sentences translated by the cc~¢outer, 
ranging 9tom fast (as in grade A where no pest-editing is 
reqtdred) to slow (as in grade F where a sentence has to be 
retranslated manually). There has been no attempt to catag~mise 
the source sentences into different degrees of difficulty ur 
length. Hence, the typology of text used in this evaluation must 
be burne in mind when assessing the overall results. Although 
grades are assigned to ~ndividual sentences, the source texts 
were extracted by paragraphs, and hence, the continuity of the 
text is maintained. The actual grading itself was carried out by 
more th2m one individual in urder to reduce (as much as possible) 
the effect of individual 'bias'. After careful scrutiny, it was 
concluded that variation in the results obtained is within 
expected limits, thus allowing broad conclusions to be drawn on 
the effectlveness/usefkflness of the translation ~stem. 
The grades assigned to translated sentences are as follows: 
A: ~ect translation, no modification required. 
B: list of alternative wurds selected by post-editur. 
C: understasdable translation (with preservation of 
meaning), single word correct lens without reference to 
source text. 
D: as in C, but referencs to source text is necessary. 
E: major modifications with reference to source text. 
F: retrasslated menua~. 
Results for Selected Area and Text 
A C~lemistry textbook fcm upper secondary school was chosen 
as the first text for the development of the laberatem-y 
prototype. A total of 393 sentences were extracted at rasdsn 
from this textbook and translated by the cemputer. The 
translation output is then graded by three htmen post-editors asd 
the result given below is based on their ccmbimed evaluation. 
Grade: A B C D E F 
No. of sentences 61 125 114 85 8 O 
Percentage % 15 32 29 22 2 0 
Cumulative % 15 47 76 98 100 100 
The above result shows that 76 % of translated sentences are 
'understandable' (no reference to English source text is 
640 
necessary) and requi~es, at the most, only mimur modifications 
during pest-editing. 
Effect of a C~ in Area and ~ 
The new text is a University level Cemputer Science 
textbook, h~mu which 207 sentences were e~<tracted, translated by 
the computer, and then graded. The result is as follows: 
Grade: A B C D E F 
No. of sentences 23 44 74 41 ii 14 
Percentage % II 21 36 20 5 7 
Ctm~lative % ii 32 68 88 93 i00 
As expected, the qus31ty of translation in tints case is 
lower than that fur the Chemistry text. Most of the additional 
problems encountered can be solved either throu~l dictiossry 
coding ur minor modiflcatiens in the grEmmmr. With these 
changes, the qtm31ty of translation for the Computer Science text 
is expected to be raised to the sane level as that fur the 
Chemistry text. 
3. Emlstlr~ Problems Classification 
An attempt was made to analyse the problems encountered, 
i.e. the errurs in translation output. This involves a tedious 
process of correctly identItyiog the source of each ereor found 
in the trasslation output, and then classifying then according to 
the phase of translation (i.e. analysis, trassfur or generation) 
at which they occur. The purpose is to identify simple problems 
which can be solved in the existing system through modifications 
to the linguistic data, while more c~plex ~oblams can be the 
subject of further research. This analysis of errors &Iso 
provides statistical infsrmation on their distribution and 
importasce, hence giving some guidelines as to their priority for 
fur thur investigation. 
The AnalXsis Phase 
The problems of a,bigui~ and coordination account for more 
than l~qlf of the errors at the ~alyals plmse. The probl~n of 
~m~iguity here refers to smbi~ties which remain unresolved at 
the end of analysis and to cases of erroneous dls~nblguation. 
This type of problem is by far the most important, accounting fur 
close to 50 percent of the e~isting errors found in the anslysis 
phase. 
Ambiguities which remain unresolved include 
vurb/noun ( ' foam1', 'wurks' ,'use' ), 
verb/adjective ( 'direct', 'total' ), 
verb/yen ( ' .. is unglazed paper.. ' ), 
noun/adjective ( 'routine', 'plural' ), 
vemb/vlng ( '.. painting of...' ), 
adJ/pronoun ( 'other' ). lasl 
Courdinatien (apposition, inclusion) is a serious structuz, al 
Droblem not Imndled particularly well by the existing gra,mar 
model. Many different types of elements can participate in 
coordination (apposition, inclusion) and exsmplss of cases not 
considered in the current grammar are: 
complex noun phrases, 
prepositions, 
V~'~.l clauses, 
interrogatives, 
adjunct phrases. 
( 'to ~d fk~m and within.. ' ) 
( '...but ..... and ..... ' ) 
('why .... and do .... ') 
( '..hot and humid.. ' ) 
0tber ex.rors in the analysis plmse are re/ative\]y less 
cemplex and can be solved throu~l modifications or im~movements 
in the morphological and sta'uctl~'al analysis ~am~l,s and in the 
coding of the source dictior~my. F~re~s in this category are: 
- errors in mogpholoElcal coding, including idinmatic 
~essions and ccs~pound words; 
- ~uD\]~\]o~31 ) 8tr%ictltres in the cua~ent llDdel, such as 
(elision) 
' although large enc~l to pass tl~u~l..' 
(embedded imperative) 
'; hence the insta~/ction: shake the bottle.' 
(complex cemparative) 
'..the same temperature as that at which.. ', 
( enumematlon ) 
' .... only 4 operations: 
I/0, s2ith,etic, cemparison, movement of data. ' 
Various bugs stl\]l exist in the mm/ysis g~amma~ model 
Itself and these will be corrected as part of the maintenance on 
;he translation s~sfi6~n. 
~be T~ansfe~ P~mse 
The ~gomadn ~roblems at the Imassfea, phase are the Jnc~nplete 
for incorrect) choice of target lexlcals, and the t~ansfer of 
I diematlc expressions. 
The diss~bi~uation of a source l~ical which car~y mare than 
one meaning and which is t~anslated bY different target lexicals 
accounts for more tlmn half of tile stagers at transfer. %~le 
source of this problem is actlm33~ at the ~lalysls ~lase, which 
was unable to ~moduce a suPficiently deep level of intexTmetation 
(e.g. se~sntics and sesmntie relations) to solve the ~bJ~uity 
which manifests itself only at trm\]sfer. 
The two categories of words which are most problematic are 
the verbal :\['(X~llS ( '~eveal' , ' assa, e', ~ call ' ) and the 
~cepositians ( 'in' ,'by' ,'to'). Although dis~bi~uation rules 
based on context are ~s~loyed du~ the structural transfer 
phase, they can only solve relatively st~a~tfca~ard cases. For 
the more dlff:tcult eases, the current av~oach of displaying a 
list of multiple choices of words to the human post-editor seems 
to be t/~e most acceptable solution. Much deeper work in state 
semantics and semantic relations will imve to be carried out in 
to im~x)ve on this. Even if such improvements are found, 
there is still tlle question of weighing the cost of such 
sopldsticatod in~cessing by the cemputea" (which is expected to be 
very high) a~nst the cost of l~m~ post-editing. 
Id~o,mtic expressions are nc~ms/.ly coded directly in the 
source dictio,~ry. Unfortunate\].y, the ARI/d~ softw~ does not 
\[movide suZficJent facilities at analysis or at transfer plmse to 
cater for scs~ of the c~Dlex manipulations requi~ed. S~me 
idiomatic expr(~ssions are ambiguous (i.e. they can be considex~d 
idlc~atic only in cemtalu context), and hence, there is tlle 
problem of (~samb~uat~\]g thCSl dlIvID.g ana\]$sis. Also, scsle 
English idi~tic 6~pressions are particularly diCficult to 
trasslate into Malay, and perhaps other target l~%~mges as well. 
The Gene~ation Phase 
Er~s during structaral generation are relatively few, and 
also relatively minor 9rc~ the point of view of post-editing. 
Most errors daxdng this phase will give rise to grade C sentences 
if there are no other type of ~s in the sentence. 
The main ~obl~ns are as follows: 
** Podition of elements in cc~plex noun phrase. 
Most of the ex~o~s are dim to the incc~'ect placement of the 
~eposition 'b~\[J' (similar I.o 'of' but not as ccmnonly used) in 
a complex MaI~v noml pl)rase. Other e\].6ments of the noun phrase 
which give rise to errors are the '-lag' or '-an' f~n used as ~ 
adjective, sad tlle lexicals 'other' and 'only' which seem 
difficult to tra~slate into Y~lay. Very often, m~ adjective Jn 
Malay is introduced by the relative pronoun 'yasg'. However, 
thrum sccns to be no consistent rule for this. Certain lexicals 
always require a 'yang', ~lle others only undex, cart~tu not 
well-defined condliions. 
*~ Position of' adverbs and sdJuncts of clauses. 
Tl~is imobl~u is not very well ~westigated in the exlst~ 
n~xle\], and can hopefL~\]¥ be improved llpon at a latex, stage. 
*~ Relative clause introduced by a prepasition. 
the relative clause introduced by a iz, epesition ( 'in which', 
from where', etc. ) is psrt~cular.ly difficult to translate into 
Mola¥ (even for htmT~n tmanslatcm). Forn~l l~%~tistic study is 
being carried out into possible target struetm'es. T~ds is one 
specific case whereh5 r linguistic research is initiated 
s~ccifically to cater for the needs of cemputer-aided- 
trasslation. 
~le generation of Malay prenouas. 
&\]othem di@ficult ~oblem is the translation of same 
~onouns- 'it', 'they', 'anothex,', 'one', 'lat±er', ~ffmmer ', 
'those'. ~e Malay \]an~lage sometimes tequilas a repetition of 
tile referenced object in place of the pronoun. Even when this is 
not necessary, as in the case of a ~onoun referriog to an 
undefined abject, it may be incorrect to translate directly with 
the equivalent ~.~noun ('ia', 'merely', 'yang lain', 'kita'). 
Again further ~mvesti~ation into the linguistic aspects of this 
problem will be necessary ~fore an acceptable solution can be 
found. 
source: 'move i~om one ~ of the solid to another' 
cemputem: 'bemgerak dari 1 ~:~ pepeJal kepada y_~ laln' 
edited: 'be, rgm,ak dari 1 b~'~_____/pepeJal kepada ~ ~%\[~K 
lain' 
4. Further Work on the I~tx~:~to~~ 
firam~r Model Deve/ep,~nt 
Many p~obl£~s r6main to be t~kled beth from the linguistic 
as ~iI as the c~,puter science point of view. Same of these 
~moblesls, especJ~d\]y at the g~nemation phase, are at the mmface 
ca • syntactical level. Furthe~ work on the K~am~a. model should 
bring about imEmovea~nts in tlds area. 
The probl~,~ of coordination during analysis and lexical 
~nbiguity ckmlng Imansfem are at a deeper s~r~ntics level. Until 
formal .linguistic work on semantics (such as Montagne G~ammr) 
can come up with same practlc~ solution, these problems are only 
amenable to a l~%flstic er~nee~ ap~ach based on some 
static categnrisation of semantics together with some generalised 
dynsmic method of processing and the ability to l~idle 
exceptional cases. 
The current Ea~lish analysis model a\]re2~ contains a very 
641 
cemprehensive set of dissmbiguation imles. Fc~, the more 
difficult cases which still r~dn unresolved at the end of 
analysis, exhaustive sea~chmethod can be employed. This is not 
as costly as it imgy seem, since a survey of such cases has 
indicated that good heu~istlc conditions mR possible to recklce 
the overall search time. 
~he current ~Llysis model attempts to achieve a deep level 
of interpretation right up to logical and semantic relations. 
Since tluts level rosy not be attainable for many of the sentences 
in a particular text, a lower level of int~etation such as 
syntactic f~nctions ac even mo~phssyntactic classes should be 
used instead. A large proportion of such sentences can still be 
translated ccr~ectly, and there/c~e, the rmovlsion of this 
'safety net' is essential. 
The development of an industrial prototype will demmnd a 
considerable increase in the size of the dictionary, at least to 
about 10,000 source lexlcal units. Hence, lexicographlc work 
represents the single most important and tlme-cousumisg task in 
the development of an imdustrial prototype. Preparations are 
alresdy underway to si~llfy this task by li~)ducing a simplified 
farm (or questiomaire) which can be filled-up by lexicographers 
with perhaps only a minimal emount of traln~. Data frem this 
'form' can then be transferred into computer codes to be used by 
the translation s~stem. 
This Ireporation of a cc~puteutsed dictionary can also be 
integrated withanyworkbelng carried out on lexical databases 
for ordinary ~cmmn consumption. The two tasks have a large 
Em~xmt of intersect~nglnformation needs, and hence, can be 
mntuallybeneficial. 
Towards an Industrial Protot~ 
The lab(matory l~rototype is now ready for development into 
an industrial Imotot~pe. %~ie first task is of course the drmvlng 
up of a list of possible applications, followed by a feasibility 
study of the text typology for each of these application. The 
fins/ selection will be based on the quality of trans\]ation which 
can be expected and the type of financial support available. 
Other important considerations include: 
the volt~e of ImanBlation waek, 
the frequency of translation wQrk, 
the urgency / speed of the translation w~k, 
the availability of a ccmplete set of Malay technical terms, 
the availability of text matezdals in machlne-readable 
format. 
Once an application Ires been selected, the next step is the 
onganlsation of the development %~rk itself. Hare, the available 
manpower is a critical element, and from experience, it is very 
difficult to convince policy makers and financial supporters on 
this. Any c~velopment teem must be made up of hlgh-callbre 
computational linguists, computer scientists, lexlcoguaphers, 
editors sud translators, who must be ~ll-tralned in the 
m~thodology of ccmputor-alded-translation besides their own area 
of specialisatlon. 
Another ~Tportant factor for planning purposes is the time 
reqtdmed to develop an Industrisl prototype, and this has also 
been frequently underestimated. It is estimated that at \].east 3 
years work by the existing research teem at Unlverslti Sales 
Mals3,sla will be x~qui~ed to complete an industrial prototype for 
~4~lish - Mala~ translation in one specific area of application. 
A Dedlcation 
Without the late Professcm B. Vauquols, the CAT IroJect at 
Urdversiti Saius Mals~vsla would not l~%ve existed. His dedication 
Ires inspired all who worked with him, and kis kindness will 
always be remembez~d. 
References 

1. \[Boitot and Vauquois ~I\] 
CbristisnBoitetandBemnard Vauquois 
'Automated 'ik~anslationat GETA' 
GETA, Aug 19~I. 

2. \[cha~78\] 
C~y~ySee 
'Computer SystemAide in NaturalLanguageDataProcessing' 
M.Sc. Thesis, U~4, Oct. 1978. 

3. \[G~A 78\] 
M. Quezel-Ambrunaz 
'ARIANE 78: Syst~m interactlf pour 
autcmatiquemultilingue' 
Tech. Report GETA, Sep 1978. 
la traduction 

4. \[Tong 78\] 
Tong Loong Cheong 
'An Infoemation Retrieval System with Linguistic Capability' 
Proc SEARCC, Sep 1978. 

5. \[Tong 82\] 
Tong Loong C%eong 
'Computer Aided Translation - Teclmlcal Raport Co~oilatlon' 
Tech. Report P~4K, Dec 1982. 

6. \[van Klink~n 84\] 
Cathsrisa van KLinken 
'Disanbiguation Strategy in English Structural Ans/ysis' 
Tech. ReIxmt PIMK, Dec 1984. 

7. \[Vauqmis ?5\] 
Boras~d Vauqu~is 
'La traduetion aufiomatique a Grenoble' 
Doctm~nts de linguistique quantltative, 
DIINOD, 1975. 

8. \[ zmmrin 84\] 
Zaharln Yusof 
'The Y~rphological Generation of Malay' 
Tech. Repc~t GETA, Oct. 1984. 
