Rapid Development of Translation Tools: 
Application to Persian and Turkish 
Jan W. Amtrup~ Karine Megerdoomian and Rdmi Zajac 
Computing Research Lab 
New Mexico State University 
{j amtrup, kar ±ne, zaj ac}Ocrl, nmsu. edu 
Abstract 
The Computing Research laboratory (CRL) is devel- 
oping a machine translation toolkit that allows rapid 
deployment of translation capabilities. This toolkit 
has been used to develop several machine translation 
systems, including a Persian-English and a Turkish- 
English system, which will be demonstrated. We 
present the architecture of these systems as well as 
the development methodology. 
1 Introduction 
At CRL, one of the major research topics is the de- 
velopment and deployment of machine translation 
systems for low-density s in a short amount 
of time. As the availability of knowledge sources 
suitable for automatic processing in those s 
(e.g. Persian, Turkish, Serbo-Croatian) is usually 
scarce, the systems developed have to assist the ac- 
quisition process in an incremental fashion, starting 
out fl'om low-level translation on a word-fo>word 
basis and gradually extending to the incorporation 
of syntactic and world knowledge. 
The tasks and requirements for a machine trans- 
lation enviromnent that supports linguists with the 
necessary tools to develop and debug increasingly 
complex knowledge about a specific  in- 
elude: 
• The development of a bilingual dictionary that 
is used for initial basic translation and can fur- 
ther be utilized in the more complex translation 
system stages. 
• Methods to describe and process morpholog- 
ically rich s, either by integrating 
already existing processors or by developing 
a morphologicM processor within the system 
framework. 
• Glossing a text to ensure the correctness of mor- 
phological analysis and the colnpleteness of the 
dictionary for a given corpus. 
• Processors and grammar development tools for 
the syntactic analysis of the source . 
• In order to allow rapid development cycles, the 
translation system itself has to be reasonably 
fast and provide the user with a rich envi- 
ronment for debugging of linguistic data and 
knowledge. 
• The system used for development must be con- 
figurable for a large variety of tasks that emerge 
during the development process. 
We have developed a component-based transla- 
tion system that meets all of the criteria men- 
tioned above. In the next section, we will describe 
the m:chitecture of the system MEAT (Multilingual 
Environment for Advanced Trmlslations) , which is 
used to translate between a number of s 
(Persian, Turkish, Arabic, Japanese, Korean, Rus- 
sian, Serbo-Croatian and Spanish) and English. In 
the following sections, we will describe the general 
development cycle for a new , going into 
nmre detail for two such s, Persian and 
Turkish. 
2 General architecture 
MEAT is a publicly available environmeut I that as- 
sists a linguist in rapidly developing a machine trans- 
lation system, in order to keep the overhead in- 
volved in learning and using the system as low as 
possible, the linguist uses use simple yet powerful ba- 
sic data and control structures. These structures are 
oriented towards contemporary linguistic and con> 
putationM linguistic theories. 
In MEAT, linguistic knowledge is entirely repre- 
sented using Typed Feature Structures (TFS) (Car- 
penter, 1992; Zajac, 1992), the most widely used rep- 
resentational formalism today. We developed a fast 
implementation of Typed Feature Structm:es with 
appropriateness, based on an abstract machine view 
(eft Carpenter and Qu (1995), Wintner and Francez 
(1995). Bilingual dictionary entries as well as all 
kinds of rules (morphology, syntax, transfer, gener- 
ation) are expressed as feature structures of specific 
types, so only one description  has to be 
mastered. This usually leads to a rapid familiarity 
with the system, yielding a high productivity almost 
from the start. 
i ht tp ://crl.nmsu. edu/~j aratrup/Meat 
982 
Runtime linguistic objects (words, syntactic struc- 
tures etc.) are stored in a central data struct.ure. 
We use an extension of the well-known concept of 
a chart (Kay, 1973) to hold all temporary and fi- 
nal results. As nmltiple components have to process 
different origins of data, the chart is equipped with 
several different layers, each of which denotes a spe- 
cific aspect of processing. Thus, at every point dur- 
ing the runtime of the system, the contents of the 
chart reflect what operations have been performed 
so far (Amtrup, 1999). The complete chart is avail- 
able to the user using a graphical interface. This 
chart browser can be used to exactly trace why a 
specific solution was produced, a significant aid in 
developing and debugging grammars. 
MEAT addresses the necessity of carrying out sev- 
eral different tasks by providing a component-based 
architecture. The core of the system consists of the 
formalism and the chart data representation. All 
processing components arc implemented in the form 
of plug-ins, components that obey a small interface 
to comumnicate with the main application. The 
choice of which components to apply to an actual 
input, the order in which the components are pro- 
cessed, and individual parameters for components 
can be specified by the user, allowing for a highly 
flexible way of configuring a machine translation (or 
word-lookup, of glossing etc.) system (el. Amtrup 
et al. (2000) for a more detailed description of the 
architecture). 
The MEAT system is completely implemented in 
G++, resulting in a relatively tast mode of op-- 
eration. The implenrentation of the TFS formal- 
ism supports between 3000 and 4500 unifications 
per second, depending on the application it is used 
in. Translating an average length sentence (20-25 
words) takes about 3.5 seconds on a Pentium PII400 (in 
non-optimized debug mode). The system sup.- 
ports Unix (tested on Solaris and Linux) and Win-- 
dows95/98/NT. We use Unicode to represent charac- 
ter data, as we face translations of several different, 
non-European s with a variety of scripts. 
3 Development cycle 
One of the main requirement facilitating the deploy- 
ment of a new  with possibly scarce pre- 
existing resources is the ability to incrementally de- 
velop knowledge sources and translation capability 
(the incremental approach to MT development is de- 
scribed in (Zajac, 1999)). In the case of translation 
sy,~tems at our laboratory, we mostly translate into 
English. Thus, a complete set of English resources is 
already available (dictionary, generation grammars 
and morphological generation) and does not need to 
be developed. 
The first step in bootstrapping a running system 
is to build a bilingual dictionary. The work on the 
dictionary usually continues throughout the devel.- 
opment process of higher level knowledge sources. 
We use dictionaries where entries are encoded as flat 
feature-value pairs, as shown in Figure 1. 
$Headword OJ lyb hftglnh 
$Category Noun 
$Number Plural 
$Regular False 
$English Seven Wonder<nc plur>; 
$$ 
Figure h A Persian--English dictionary entry. 
While this is already enough information to faciL 
irate a basic word-.for-word translation, in general 
a morphological analyzer for the source  is 
needed to translate real-world text. For MEAT, one 
can either import the results of an existing morpho- 
logical analyzer, or use the native description lan- 
guage, based on a finite-state transducer using char- 
acters as left projections and typed feature struc- 
tures as right projections (Zajac, 1998). After com- 
pleting the morphological analysis of the source lan- 
guage and specifying the mapping of lexical features 
to English, glossing is available. The Glosser is an 
MEAT application consisting of morphological anal- 
ysis of source  words, tbllowed by dictionary 
lookup for single words and compounds, and the 
translation into English inflected word forms. An 
example of the interface \[br the glosser is shown in. 
l?igure 2. 
...................... 
File Edit View G0 C0mlaun\[c~2ot Helo 
-: - "7- .... 
~',~3~ t~l £ndonttsia; 
J and; 
~3/z~}J_~ contract j 
~ @rant; ~stit~o; p~.ttir~/; 
&~- to; atl in; on; for; 
)~ )~ East TJ2~O=; 
b .; 
L~I mi~j~at~r~; approval.; slanatu~; 
u~ wine; 
~I~E dig; pull ofl; do; 
~J/i IndonQ=ia; 
J an~ 
3J~i today; t odag; 
&~a;(&~3; (Ve±\] ......... diq; 
pull off; 
Pre~ent; Th/x~; Plural_ 
~c~ive; S~o3~ctive 
1~ unstgned Javn ~pFl~t Whqd~v 
'c-A i .... iAp~a taeat~p~ ;~ r,~,~ '.~ "- -5 "J ~ -2- 
Figure :2: The Glosser interface 
The next step in developing a nmdium-.quality, 
broad coverage translation system is to develop 
983 
.-,C _2227.7_Z:22_C-_~TfT2L27AL--Z:ZZT~:7:hfZ.._~\[_ 2 -,~Nv,~-:c~_ j~ ~-::27 _ ~ ~ T--_ZiLT.\[ Z.::__22 2 7J:.7-.-_L_2_..2Y Y_~:2-__-£ .I.: .4_ 
I 
Q~II% ; 
Vtn'Ilces 
I-dgeS 
161Z 
Sta~. Ima 
i0 ...... 
N~k i 
~-~::-2-L77 ............ 7"-"---,A,/"-->..# / 
_--J --. " -" ,,'>-~ / j~<-~2' 
"-. /,", .->%/" ./..C,-<.~Z--, ~ .. / , ~....,.,. _.,.~.~ 
\ //'. I/ ,'::;G ~ > "';-5-"..'~,~ .... ~1/5 I/' /.'-'/2?" ~51"-'2-.~,-I-~.:--.<<." 
" II I" t/ ':" -" 1/.-~2,./-~ /./~..,. 7 
\ 1' ,/ //.$.-'L>:'7..~;v, .... IK,*;' .'%-,~ - 
\ f/ ~¢,~" ~,h'~ 2 ...... ~,¢7:-.. \ 
bylqYy Iz sh mD l)y'~ 
~.~'~c'_. -~.~_~ .... :--. "-.. ;.q<*j -.. "-, %,. ~.* 
....... ~" 7-Yb£:--'" ...... :-v~<~, .... "-3,.',q../" 
_ _~,~7<---7..2;.=~ ......... ~ ~-..~. ~_. ,zv%,>,y. x'q.,,\ ..... -~- - 
~ \ ,x~*.~ J~" --.. ,, ...... ---.-¢.. v:>:< "-~--'>. ";~.y" _ 
Ifzly"s ^slIydy ylfth Is/ v I 
83 gg g9 ?3 ?7 6'~ fib g6 100 lilt I'" 
Cam lamt: Sentence \[ ..... i- 
........... l:\[aer.l~.~M~lra~ ounaTZeel \[Ad\]ecti'3e" \[ I \[ \[Pre m~lta0nA. \]~'\]N' \] ICPoIIIP \[ \[ \[ \[ \[~lu~er al \]l'!u~' I Nu:'~ \] $"":"VerhlUPlS ", \[NounlN' INPo \[ \[ \[Noun\]If' I NP0l,'ct; ~IPPIXIXP \[\[\[\[M l' i 
~ :lmr.l:~le ..~e rdetg: e he~d VerbPhraae I 
a~er.rml~.~nm,~ head f:nt~y( ~ "' form g,,~m I 
4:per.R111e. Se.te|i¢~ ;~orph Ve ;.bllo i:pholo~-i \[ 
,~ ,~, \[blle. ,~e,~ |~Ice lex Ve rha\]J, ox~call 
\[i:laa i*. FRIIIt. ~.!~, telll?~ ~ poa 9eeb. 
7:ger.Rule.~e.tei*cei p~,::)entS te~. "ylh% regul.a," T~uel. 
II:p~ r. l'Illl~l. ~n ll~llrA~ infl Ve~bal.lnf kect\].on\[ 
,~: NF. ~t1111. ~\[~n ~t II g ~ voice Undeflned, 
tlhper.Rtlleo~len~m;e clztzc : C?.Jtic\[ function : Nulll, 
11:~.Rtde.8~It~I1C8 tense ' Particlp1~L 
IZ:per.RtdmS6~ten~s I causatlw False, I 
1:J--el' ~UlI~ ~lall|/eligli~ i , person Unde~:tned. , \[ . 4' • ' ~ood Un(~f~.ned. { 
14:lll~¢.Rt~e'Se"~ttce \[ nu~erAg~eement Undefined, 
nege, tlon False\] l, 
, og~ Or rhography ! 
key : "ylftn", * 
!~ \[:--~ Li ................................................... ~ .... 
............... ........................ ....... :.::: ::::_::. :: .:: :-=._: ....... : .... -: =- :_-.::_ .:_ :. ::_ '::-::-_ _:. __: :_;: .:_._ _-_._=>-:_- 2::J 
Figure 3: Viewing a eomple:~c aual&,;is 
knowledge sources for the stt°uctural analysis of in- 
put sentences. Mt';AT supports the use of modm 
lar mfific.ation grammars, which facilitate,," develop. 
merit and debugging. E;ach grammar module ca,~. be 
developed and tested irJ isolation, the final system 
applying each gtammar in a linear fashion (Zajac 
and Amtrup, 2000). The main. component used is a 
bidirectional island-.parser (cf. Stock et al. (1988)) 
for unification-based gra.mmars. The grammar rules 
are usually writtet~ in the style of context-free rules 
with \[~ssociated unification constraints as shown i~ 
Figure 4. The rules allow for the specification of 
the right-hand side as a regular-.expression of feature 
structures. We plan to add more restricted types of 
grammars (e.g. based on finite-.state transducers) to 
give the linguist a richer choice of syntactic processes 
to choose from. 
For the time being, the transfer capabilities of the 
system are restricted to lexical transDr, as we have 
not finished the implementation of a complex trans- 
fer module. Thus, the grammar developer either 
needs to create structural descriptions that match 
the English generation, or the English generation 
grammar has to be modified for each . 
At each point during the development of a trans-. 
Adj ~ ::: 'i;u:c,thdLc; \[ 
Jim : i;u~:', g djBa~: \[ 
\[\[c:x:}.!eAd : #Ld\]j 
g}pe{.:: :'~adv\] , 
#adv: : tl**. &dvg~Et;ry "Y" 
#ad.j'::: i:.u:~.', gdjEnt:!:y 
:> 
\]; 
Figm'e 4: A Turldsh syntax rule 
lation syste.m, we consider it essential to be ahle 
not ouly to see the resuRs, but also to monitor 
the processing history of a result. Thus, th.e chart 
that leads 1;o the construction of all English olttptlt 
can be viewed in order to examine all intermediate 
constructions. In the MEAT system, each module 
records various steps of computations in the chart 
which can be inspected statically after processing. A 
unified data interface for all modules in the systein 
allows both the inspection of recorded internal data 
structures for each module (when it makes sen:-;e, 
984 
such aq ill a ch.art parser), aml tile im;pecticm of tit(.' 
input/ou£put of all module,';. The graphical i~terfa.ce 
used to view complex a.t~alyscs i.r; shown in Figure 3. 
d A.pp\]J.ca{;ions 
In this section, we give an overview of the capal)i!i. 
ti(~'; of MEA71' using ~wo (-l!ri{~.nt examples from work 
af, our laboratory. In the Shiraz project ? (Amtrup 
ei; al., 2000), we developed a machine translatio~ 
system from Farsi tO English, for which no previous 
knowledge sources were awdlable. We mainly target 
news material and the transl~tion of web pages. Tile 
Tm'kish~English system has been developed with the 
Expedition project a, an enterprise for the rapid de- 
velopment of MT systems for low-density s. 
Botl/systems use a common user interface for ac~ 
cess to MEAT, which is shown in Figure 5. The 
MT systems are targeted t() the translation of newsy. 
paper text and other sources available online (e.g. 
web pages). The emphasis is therefore put ou exten+ 
sive coverage rather than very-+high quality transta+ 
tion+ Currently, we reach for bol;h systmns a level 
of quality that allows to assess in detail tile content 
of source texts, at the expense of some rmfelicitous 
English. 
r~ ta~ 
............. D vc~lv~ {ll;t,~c s:/Im~oa~tj am~ In Iplr,eanhltm-! C fl t g'~A. t..R ~d~q~ ~f~l ,+01. bl t 
............... i¢i ~ZL2Z -2-~£2-£.27 ~-+ ~2Z7-~ -_._ _L_,7 
................ T~ak~tma 
c~,~ :\[iT~;i-;;,a;;~Eh Tg~GiTg~'TggU~..,.-gJ~ g g;gTJE.TKg7 \[y 
T7:2~: 77~7~7 : I}IC~W: CriaX~ =true\]( ocon0~, 
Dm ( rL~Rhlh ) I I 
'' ~ " ~t" j C ................................................................................... \[ : . 
~-~ ............................. s~,2~, ~,~ ..... 
~ht~l~llt~a }lONer gx~cutin 9 llodu\[~ cll~rgS~v~r +~ 
........ ~5:___.::~ ...... :\[chartsauer Size-ag, re~xde~t-S~, toCah,.O, C~U.,0 010a ' • ht~l-~ IRt |~ Executin~ Nodule su'fce I 
\[c+ Sueface0ener~nor Startinq t~ 1 +OZ L~t s 
t~tl-O3,lxt ": : S~tcfGcn:Ur f~ce0onor~orsiz~+0E ~=~tdeltt=~K'znL~h~d" to~cl-0, ~pII-00{\]0s .J 
hl~*1-0/'bct I r }Tot~\]." g~ze=S2R, re~id~nt-48g, toeal~99800 C~J=2 3~0s (croat ~/ \]Iu,-.+; =~ , i ~ .................................... 
Figure 5: The MEAT user interface 
4,1 Persian..+Englisb MT 
The input for the Persian-English system is usu- 
ally taken from web pages (on-line news articles), 
although plain text can be handled as well. The im 
ternal encoding is Unicode, and various codeset con- 
w.'rters are available; we also developed an ASCII- 
based transliteration to facilitate the easy acquisi~ 
tion of dictionaries and grammars (see Figure 1). 
The dictionary consists of approximately 50,000 en+ 
tries, single words as well as multi-word compounds. 
Additionally, we utilize a multi-lingual onomasticon 
maintained locally to identify proper names. 
2http ://crl. ransu, edu/shiraz 
3http ://crl. nmsu. edu/expedit ion 
we developed a, morphological grammar for Per- 
siai~ using a unifies.lion-based formalism (Zajac, 
1998). A sample rule is showtt in Figure 6 (of. 
Megerdoomian (2000) for a more thorough descrip 
~ion of the Persian morphological analyzer). 
PresentStem = < 
Regul arPx'e sent St em 
< < <"In" "y"?> 
per. Verbal\[inf\]..causative: Trne\]> I 
< per.Verba.\]o\[infl.causative: False\]> 
>>; 
Figu.re 6: A Persian morphological rule 
The knowledge sources for syntax were (manually) 
developed using a corpus of 3,000 tagged and brack- 
eted setrt, ences extracted h:om a 10MB corpus of Per.° 
sian news articles. We use three grammars, respon.- 
sible lor the attaetunent of auxiliaries to main verbs, 
the recognition and processing of light verb phenom-. 
ena, and phrasal and sentential syntax, respectively. 
The combined size is about 110 rules. The develop. 
ment of the Persian resources took several months, 
primarily due to the fact that the translation system 
was developed in parallel to the linguistic knowl.. 
edge. The Persian resources were developed by a 
team of one computational linguist (morphological 
and syntactic grammars, overall supervision ibr lan-- 
guage resources), and 3 lexicographers (dictionary 
and corpus annotation). 
4°2 'lti~urkish+-English MT 
Withii~ yet another project (Expedition), we devel- 
oped a machine translation system, for Turkish. This 
application functioned as a benchnrark on how much 
effort the building of a medium-quality system re-- 
quires, given that an appropriate framework is al-. 
ready available. 
For Turkish, we use a pre-existing morphologi- 
cal analyzer (Oflazer, 1994). Turkish shows a rich 
derivational and inflectional morphology, which ac- 
counts for most of the system development work 
that was necessary to build a wrapper for inte- 
grating the Turkish morphological analyzer in the 
system (approximatly 60 person-hours) 4. The de- 
velopment of the Turkish syntactic grammars took 
around 100 person-hours, resulting in 85 unification- 
based phrase structure rules describing the basics of 
Turkish syntax. The development of the bilingual 
Turkish-English dictionary had been going on for 
4The main problem during the adaptation was the treat- 
ment of derivational information, where the morphologically 
analyzed input does not conform exactly to the contents of 
the dictionary 
985 
some time prior to the application of MEAT, and 
currently contains approximately 43,000 headwords. 
5 Conclusion 
The development of automatic machine translation 
systems for several s is a complicated task, 
even more so if it has to be done in a short amount 
of time. We have shown how the availability of a 
machine translation environment based on contem- 
porary computational linguistic theories and using 
a sound system design aids a linguist in building 
a complete system, starting out with relatively siln- 
ple tasks as word-for-word translation and increlnen- 
tally increasing the complexity and capabilities of 
the system. Using the current library of modules, it 
is possible to achieve a level of quality on par with 
the best transfer-based MT systems. Some of the 
strong points of the system are: (1) The system can 
be used by a linguist with reasonable knowledge of 
computational linguistics and does not require spe- 
cific programming skills. (2) It can be easily con- 
figured for building a variety of applications, includ- 
ing a complete MT system, by reusing a library of 
generic modular components. (3) It supports an in- 
crelnental develol)ment methodology which allows to 
develop a system in a step-wise fashion and enables 
to deliver rumfing systems early in the development 
cycle. (4) Based on the experiences in building the 
MT systems mentioned in the paper, we estimate 
that a team of one linguist and three lexicographers 
can build a basic transfer-based MT system with 
medium-size coverage (dictionary of 50,000 head- 
words, all most fi'equent syntactic constructions in 
the ) in less than a year. 
One major improve,nent of the system would be 
a nmre integrated test and debug cycle linked to a 
corpus and to a database of test items. Although ex- 
isting testing methodologies and tools could be used 
(l)auphin E., 1996; Judith Klein and Wegst, 1998), 
building test sets is a rather time-consulning task 
and some new approach to testing supporting rapid 
development of MT systems with an emphasis on 
wide coverage would be needed. 

References 

Jan W. Amtrup, Hmnid Mansouri Rad, Karine 
Megerdoomian, and Rdmi Zajac. 2000. Persian- 
English Machine Translation: An Overview of 
the Shiraz Project. Memoranda in Computer and 
Cognitive Science MCCS-00-319, Colnputing Re- 
search Laboratory, New Mexico State University, 
Las Cruces, NM, April. 

Jan W. Amtrup. 1999. Incrcmcntal Speech TraTtsla- 
tion. Nmnber 1735 in Lecture Notes in Artificial 
Intelligence. Springer Verlag, Berlin, Heidelberg, 
New York. 

Bob Carpenter and Yan Qu. 1995. An Abstract 
Machine for Attribute-Value Logics. In Proceed- 
ings of the ~th International Workshop on Pars- 
ing Technologies (II/VPT95), pages 59-70, Prague. 
Charles University. 

Bob Carpenter. 1992. The Logic of 2~ped Feature 
Structures. Tracts in Theoretical Computer Sci- 
ence. Cambridge University Press, Cambridge. 

Lux V. Dauphin E. 1996. Corpus-Based annotated 
Test Set for Machine Translation Evaluation by 
an Industrial User. In Proceedings of the 16th 
International Conference on Computational Lin- 
guistics, COLING-96, pages 1061-1065, Center 
for Sprogteknologi, Copenhagell, Demnark, Au- 
gust 5-9. 

Klaus Netter Judith Klein, Sabine Lehmann and 
Tillman Wegst. 1998. DIET in the context of MT 
evaluation. In Proceedings of KONVENS-98. 

Martin Kay. 1973. The MIND System. In. 
R. Rustin, editor, Natural Language Processing, 
pages 155-188. Algorithmic Press, New York. 

Karine Megerdoomian. 2000. Unification-Based 
Persian Morphology. In Proceedings of the CI- 
Cling 2000, Mexico City, Mexico, February. 
Kemal Oflazer. 1994. Two-level Description of 
Tnrkish MorI)hology. Literary and Linguistic 
Computing, 9(2). 

Oliviero Stock, Rino Falcone, and Patrizia Insin- 
namo. 1988. Island Parsing and Bidirectional 
Charts. In Proc. of the 12 t1~ COLING, pages 636- 
641, Budapest, Hungary, August. 

Shuly Wintner and Nissim Francez. 1995. Pars- 
ing with Typed Feature Structures. In Proceed- 
ings of the ~th International Workshop on Parsing 
Technologies (\]WPT95), pages 273 287, Prague, 
September. Charles University. 

Rdmi Zajac and Jan W. Amtrup. 2000. Modular 
Unification-Based Parsers. In Proc. Sizth lr~,tcrna- 
tional Workshop on Parsing Technologies, trento, 
Italy, February. 

Rdmi Zajac. 1992. Inheritance and Constraint- 
Based Grammar FormMislns. Computational Lin- 
guistics, 18(2):159-182. 

R&mi Zajac. 1998. Feature Structures, Unification 
and Finite-State Transducers. In FSMNLP'98, 
Intcrnational Workshop on Finite State Methods 
in Natural Language Processing, Ankara, ~lhlrkey, 
June. 

Remi Zajac. 1999. A Multilevel Frmnework for In- 
cremental Development of MT Systems. In Ma- 
chine Translation Summit VII, pages 131-157, 
University of Singapore, September 1.3-17. 
