Maltilex: A Computational Lexicon for Maltese 
M. Rosner, J. Caruana and R. Fabri 
University of Malta, Msida MSD06, Malta 
mros@cs, tun. edu. mt, j car l@um. edu. mt, rfab l@um. edu. ml; 
Abstract 
The project described in this paper, which is 
still in the preliminary phase, concerns the de- 
sign and implementation of a computational 
lexicon for Maltese, a language very much in 
current use but so far lacking most of the in- 
frastructure required for NLP. One of the main 
characteristics of Maltese, a source of many dif- 
ficulties, is that it is an amalgam of different 
language types (chiefly Semitic and Romance), 
as illustrated in the first part of the paper. The 
latter part of the paper describes our general 
approach to the problem of constructing the lex- 
icon. 
1 Introduction 
With few exceptions (e.g. Galea (1996)) Mal- 
tese is pretty much virgin territory as far as 
language processing is concerned, and therefore 
one question worth asking is: where to begin? 
There are basically two extreme positions that 
one can adopt in answering this question. One 
is to attack a variety of applications first, e.g. 
translation, speech, dialogue etc., and hope that 
in so doing, enough general expertise can be ac- 
quired to build the basis of an NLP culture that 
is taken for granted with more computationally 
established languages. The other extreme is to 
attack the linguistic issues first, since, for what- 
ever reason, there is currently rather little in the 
way of an accepted linguistic framework from 
which to design computational materials. 
We have decided to adopt the middle ground 
by embarking upon the construction of a sub- 
stantial machine-tractable lexicon of the lan- 
guage, since whether we think in terms of appli- 
cations or linguistic theory, the lexicon is clearly 
a resource of fundamental importance. 
The construction of the lexicon involves two 
rather separate subtasks which may in practice 
97 
become interleaved. The first is the identifica- 
tion of a set of lexical entries, i.e. entries that 
will serve as the carriers of information. The 
second is the population of the entries with in- 
formation of various kinds e.g. syntactic, se- 
mantic, phonological etc. 
Our initial task, trivial as it may sound, is to 
concentrate on the first of these subtasks, creat- 
ing what amounts to a word list, in a machine- 
readable and consistent format, for all the basic 
lexical entries of the language. The idea is that 
this will subsequently be used not only as a basis 
for applications (initially we will concentrate on 
spell-checking), but also as a tool for linguistic 
research on the language itself. 
2 The Maltese Language 
Maltese is the national language of Malta and, 
together with English, one of the two official 
languages of the Republic of Malta. Its use be- 
yond the shores of the Maltese islands is lim- 
ited to small emigrant communities in Canada 
and Australia, but within the geographical con- 
fines of Malta, the language is used for the 
widest possible range of types of interaction 
and communication, including education, jour- 
nalism, broadcasting, administration, business 
and literary discourse. 
Unsurprisingly in view of the disparate po- 
litical and cultural influences the islands have 
been exposed to over the centuries, Maltese 
is a so-called 'mixed' language, with a sub- 
strate of Arabic, a considerable superstrate of 
Romance origin (especially Sicilian) and, to a 
much more limited extent, English. The Semitic 
(Western/Maghrebi Arabic) element is evident 
enough to justify considering the language a pe- 
ripheral dialect of Arabic. Its script, codified as 
recently as the 1920s, utilises a modified Latin 
alphabet. This is just one of the peculiarities of 
Maltese as compared to other dialectal varieties 
of Arabic, more important ones being its status 
as a 'high' variety and its use in literary, formal 
and official discourse, its lack of reference to any 
Qur'anic Arabic ideal, as well as its handling of 
extensive borrowings from non-Semitic sources. 
These features make Maltese a very interesting 
area for those working in the fields of language 
contact and Arabic dialectology. 
2.1 The Maltese Alphabet 
As noted above, Maltese is the only dialect of 
Arabic with a Latin script. Maltese orthogra- 
phy was standardised in the 1920s, utilising an 
alphabet largely identical with the Latin one, 
with the following additions/modifications: 
Maltese 
h 
gh 
h 
ie 
Pronunciation 
chip (Eng) jam 
(Eng) 
silent 
mostly silent 
hat (Eng) 
zip (Eng) 
ear (Eng) (approx) 
2.2 Morphological Aspects of Maltese 
The morphology is still based on a root-and- 
pattern system typical of Semitic languages. 
For example, from the triliteral root consonants 
h - d - rn one can obtain forms like: 
liadem work (verb); 
haddiem worker; 
hidma work (noun); 
nhadem be worked (verb passive); 
haddem caused to work. 
Most of these forms are based on produc- 
tive templates (binyanim/forom/conjugations), 
of which Maltese has a subset of those in Clas- 
sical Arabic. One other typical feature shared 
with Semitic languages is broken plural forma- 
tion as opposed to so-called sound plural. A few 
examples are: 
qamar moon qmura moons; 
tifel/tifla boy/girl tfal children. 
Plural formation in such instances involves a 
change in CV pattern. Sound plural formation 
involves affixation of suffixes such as -i, very 
common with words of Romance origin, -let or 
-a as in: 
karozza car karozz- i cars; 
ikla meal ikl- iet meals; 
haddiem worker haddiema workers. 
Maltese has taken on a very large number of 
Romance lexical items and incorporated them 
within the Semitic pattern. For example, pizza, 
a word of Romance origin, has the broken plu- 
ral form pizez (compare Italian pizza/pizze), 
and ~ippa, a very recent borrowing from English 
(computer chip) has a broken plural form ~ipep. 
In certain cases, one gets free variation between 
the broken plural form and a sound plural based 
on (Romance) affixation, e.g.: 
kaxxa box kaxex/kaxxi boxes 
tapir carpet twapet/tapiti carpets. 
The stem, as opposed to the consonantal root, 
also plays an important role in word forma- 
tion, in particular in nominal inflection. Typi- 
cal stem-based plural forms in which the stem 
remains intact are: 
ahar news item ahbar- iiet news 
omm mother omm- ijiet mothers 
Verbs are also often borrowed and fully inte- 
grated into the Semitic verbal system and can 
take all of the inflective forms for person, num- 
ber, gender, tense etc. that any other Maltese 
verbs of Semitic origin can take. For example: 
spjega explain (It. spiegare) 
jispjega he explains 
nispjegaw we explain 
spjegat she explained 
spjegajt I explained, etc. 
izzuttja kick a football (Eng. shoot) 
jixxuttja he kicks 
nixzuttjaw we kick 
izzuttjat she kicked 
ixzuttjajt I kicked, etc. 
The vigour and productivity of these pro- 
cesses is attested to by the fact that one keeps 
coming across new loan verbs all the time (in- 
creasingly more from English), both in spoken 
and in written Maltese, without the language 
having any difficulty in integrating them seam- 
lessly into its morphological setup. 
Within the verbal system complex inflectional 
forms can also be built through multiple affixa- 
tion. For example, the word 
98 
bghat - t - hie - lu - x 
'I didn'tsend her to him', contains the the suf- 
fixes -t for 3rd person singular masculine sub- 
ject (perfective), -hie for 3rd person singular 
feminine direct object, -lu for 3rd person sin- 
gular masculine indirect object, and -x for verb 
negation. This ready potential for inflectional 
complexity is another Semitic feature of Mal- 
tese which applies across the board, whatever 
the origin of the verb. It also raises interest- 
ing questions concerning the nature of lexical 
entries, the relationship between lexical entries 
and surface strings, and the kind of morpholog- 
ical processing that is necessary to connect the 
two together. 
Many of the linguistic issues that could help 
to resolve these questions are themselves unre- 
solved for lack of suitably organised languaage 
resources (like the lexicon itself!). For this rea- 
son, we see the design/implementation of the 
lexicon, the development of language resources, 
and the evolution of linguistic theory for Mal- 
tese as three goals which must be pursued in 
parallel. 
At this very early stage of the project, we 
have sidestepped many of the finer issues by 
opting to codify the most uncontentous parts 
of the lexicon first, as described below. At the 
same time, we are in the process of develop- 
ing an extensible text archive which will serve 
as the basis for empirical work concerning both 
the lexicon and the underlying linguistics. 
3 Constructing the Lexicon 
The two main resources available to construct 
the lexicon are dictionaries and text corpora. 
Both, in some sense, are representative of the 
lexical behaviour of words, and both have their 
advantages and disadvantages. 
3.1 The Dictionary Approach 
The basic idea underlying the dictionary ap- 
proach is this: if some lexicographer has al- 
ready gone to a great deal of trouble to compile 
a dictionary, why not make use of that work 
rather than repeat it? The appeal is obvious, 
and can be made to work, as is evidenced by, 
for example, the work of Boguraev and Briscoe 
(1987) who attempted to extract entries from 
the machine-readable version of Longman's dic- 
tionary. Problems of a practical nature soon 
99 
arise, however, such as: 
• What to do if a machine readable version 
of the printed dictionary is not available, as 
is in fact the case with Maltese. 
• How to deal with the idlosynchratic for- 
mats adopted by different lexicographers, 
and how to handle the omissions and in- 
consistencies that are characteristic of all 
human oriented dictionaries. 
• Once the information is available, how to 
represent it. 
• How to deal with evolution of the language 
under investigation. Dictionaries always 
reflect the language as it was, not as it is. In 
the case of Maltese this problem is partic- 
ularly acute, given that the most obviously 
useful dictionary contains a large number 
of entries that are regarded by many as ar- 
chaic. 
Many of these problems, except the last, are 
alleviated by adopting an essentially manual ap- 
proach in the early stages. We have adopted 
the most complete and detailed dictionary cur- 
rently available by J. Aquilina (Aquilina, 1987) 
and are in the process of transcribing the so- 
called major entries into our own format by 
means of a form interface as illustrated in figure 
3.1. Major entries of this dictionary comprise 
a specific, orthographically distinguished (capi- 
talised) subset containing the basic lexical forms 
of the language. They thus form a reasonable 
starting point for our purposes. The other (non- 
capitalised) entries are derived lexical forms of 
various kinds. 
For the present, we are simply ignoring in- 
flectional forms, since ultimately it is more ef- 
ficient to assume that they can be systemati- 
cally related to the basic entries by a morpho- 
logical transformation of the sort implemented 
by Galen (1996). 
The most important information is headword, 
a sequence of characters used to identify a par- 
ticular lexical primitive or lexeme. Most of 
the time, the headword and the lexeme are in 
one:one correspondance, but there are excep- 
tions. Distinct lexemes (and therefore entries) 
with the same headword are homonyms (e.g. 
tikk, a clock tick and tikk, a facial spasm). 
Single lexemes can also manifest polysemy, dif- 
ferent meanings under the same headword (e.g. 
Maltilex Lexical Entry 
HeadWord: 
IIs-= 
Vadam 2: Vafi~mt 1: 
Verb 
Tr.msilive 
SuhuamJve 
Noua 
Dimunifive 
Gender 
Masculine Fenuninlne \] 
Plurals 
Vadam I: Vadant 2: 1 
Vadam 3: Coll¢~ve: \] i 
I mr~lJ,~e ~ l~e 
Verbal Noun \[\[ Noun Agem 
Searching the Word List 
Emer a wo~ 
Definitions 
Smm 
a morl/~me or' a ctmlb~nafion of nnot~anes It) which affixes are added 
P~vious Page 
Holr~ 
Figure 1: Internet Form for Dictionary Entries 
tikka, a point-like diacritic mark and tikka, a 
very small amount). 
These variations are accommodated using the 
headword (string), homonym (integer) and pol- 
yseme (integer) fields in the form, the inte- 
gers deriving from the ordering implicit in the 
printed dictionary. 
The second line of the form contains root 
(typically 3 consonants) and stem information 
for words of semitic and non-semitic origin re- 
spectively, whilst the third contains variants 
(e.g. farfett/ferfett, butterfly). 
The remainder of the form contains mostly 
grammatical information, including that on 
(various forms of) plural. There is also space 
for comments from the individual lexicographer. 
The end product of the work described in this 
section is essentially a list of lexical entries for 
what we are calling the uncontentious parts of 
the language. The content of entries is essen- 
tially by reference (to the entries of Aquilina's 
dictionary) rather than literal. 100 
3.2 The Corpus Approach 
Comparatively recent technological changes 
have made it possible, in principle, to create 
and maintain corpora that are sufficiently large 
and accessible to be suitable for the purposes of 
Iexical acquisition. One of the greatest advan- 
tages of the corpus approach to lexical acquisi- 
tion, compared to the dictionary approach just 
described, is that in principle such corpora come 
as close as it is possible to get to a truly current 
snapshot of the language, particularly if they 
are continuously updated. Other arguments in 
favour of using texts as the basis for lexical ac- 
quisition are advanced in the editor's introduc- 
tion to Boguraev and Pustejovsky (1995). 
To adopt the corpus approach it is of course 
necessary to have a corpus, so that a priority 
task is the construction of a machine-readable, 
evolving record of the current written language. 
All the main Maltese language newspapers have 
been approached, and some journalistic texts 
(various fields) have already been obtained. We 
have recently managed to obtain speech corpora 
with parallel text of national radio news broad- 
casts. Furthermore, practical arrangements are 
currently being made for the provision of such 
materials on a regular and frequent basis. Book 
publishers have agreed to make titles from their 
respective ranges available for inclusion in the 
corpus. As it stands, the raw collection includes 
a number of book excerpts from various titles. 
One feature of this approach is the constantly 
evolving relationship between corpus and lex- 
icon: the corpus enriches the lexicon, but as 
the latter evolves, it can be used to add fur- 
ther information to the corpus in the form of 
annotations or tags, thus expanding its scope. 
A corpus annotated with part-of-speech tags, 
for example, can be used to infer a statistical 
model that can be harnessed to efficiently and 
automatically assign tags to previously unseen 
texts. 
3.3 Character Representation 
In the course of collecting corpus texts, it soon 
became apparent that, as a result of lack of 
standardisation early on in the introduction and 
spread of IT in Malta, a certain amount of anar- 
chy reigns, with various computer/printer sup- 
pliers having developed and disseminated 'Mal- 
tese' adaptations of existing fontsets. The fact 
that they proceeded independently of each other 
and with no external regulation meant that the 
same Maltese-specific characters were assigned 
different ANSI codes in Windows (TTF) fonts 
supplied by competing sellers, making it diffi- 
cult to read documents not only across plat- 
forms but also within the same platform. 
A persistent challenge to the computational 
treatment of Maltese is therefore the question 
of text representation, i.e. the numerical coding 
for the characters that make up words. The 
requirements are: 
• That the coding should follow an interna- 
tionally recognised standard. 
• That there exist appropriate fonts for 
use on the screen and on the printer 
across a variety of hardware platforms 
(PC/Mac/Unix). 
• That there exists an accepted keyboard 
configuration to generate the codes. 
Although no code satisfying all of these 
requirements exists, the most acceptable 
workaround available at present is to adopt 
fonts conforming to IS08859-3, known as Latin 
Alphabet No. 3. Two PC-compatible fonts con- 
forming to this standard are known as "Tor- 
nado" and "FTIMAL" and we are currently in- 
vestigating the copyright status of each of these. 
Given that these fonts are closely tied to 
PC (rather than Unix or Macintosh platforms), 
and given rather casual attitude taken to the 
adoption of text representation standards lo- 
cally, we have defined a project-internal Stan- 
dard Maltese Text Representation (SMTR) for 
storing text archives in a way that is (a) human- 
readable (and human-editable), (b) compatible 
with Unix systems and (c) easily translatable 
to and from any other coding format by means 
of simple finite-state methods (we are using Xe- 
rox's xfst for this purpose). 
Maltese Ascii 
_c 
gh _y 
h _.h 
_2. 
ie _i i01 
4 Conclusion 
This paper has attempted to convey our ap- 
proach to the problem of rendering Maltese 
amenable to current language engineering tech- 
niques via the construction of a computational 
lexicon. One difficulty that we are currently fac- 
ing is a shortage of appropriately qualified per- 
sonnel to work on the project, though hopefully 
this problem will be alleviated by the appear- 
ance of our first CS/Computational Linguistics 
graduates during the coming year. Three sub- 
projects are currently in the pipeline with the 
following themes: 
• Finite State Methods. Development of 
finite state transducers for extracting lexi- 
cal information from text corpora. 
• Computational Grammar. Develop- 
meat of a grammar and parsing system for 
Maltese sentences. This will probably be 
based on HPSG. 
• Computational Morphology of Plural 
Forms. 
5 Acknowledgements 
The authors grastefully acknowledge the con- 
tribution made by the Mid-Med Computer and 
Commerces Foundation to the funding of this 
project. 

References 
J. Aquilina. 1987. Maltese-English Dictionary. 
Midsea Books. 
B. Boguraev and T. Briscoe. 1987. Large lex- 
icons for natural language processing: ex- 
ploring the grammar coding system of ldoce. 
Computational Linguistics, 13:203-218. 
B. Boguraev and J Pustejovsky. 1995. Cor- 
pus Processing for Lexical Acquisition. MIT 
Press, Cambridge, Ma. 
D. Galea. 1996. Morphological analysis of mal- 
tese verbs. Technical Report B.Sc Disserta- 
tion, Department of Computer Science, Uni- 
versity of Malta. 
