Manipulating human-oriented dictionaries with very simple tools 
Jean Gaschler & Mathieu Lafourcado 
(Jean .Ga~chle~imnq. fF Eathieu. La fourcade@ \] rnac l. f r ) 
GETA, IMAG-campus (UJF & CNRS) 
BI' 53, 1:-38041 GI~.I{NOIII.E Codex 09 
Abstract 
It is possible to manipulate real-size human-oriented 
dictionaries on a Macintosh by using only very simple 
tools. Our methodology has been applied in the 
construction of a l:rench-Fnglish-Malay dictionary. This 
dictionary has been obtained by "crossing" 
.~mi-aulomatically two bilingual dictionaries. "1"o revise 
Ihe dictionary, Its well Its to obtain a publisluilile paper 
forli/ aiid an on-line electronic fornl, we rise only 
I'M Microsoft Word , a specialized langliage for writing 
transcriptors and a small but powerful diclionary 1ool. 
Keyword 
IAnguistic tools, Transducers, l)ictionary managenmnt, 
I htman-oriented dictionaries. 
Introduction 
In collaboration with Univcrsity Sains Malaysia (USM), 
we are working on a French-F.nglish-Malay 
human-oriented dictionary (F'EM project) obtained by 
"crossing" French-lhiglish and lh~glish-Malay 
dictionaries. 
Taking into account the reluclanee of lexicographcrs to 
revise dictionaries through database interfaces 
(dB AS ElIl -"~" or 41) TM), we have developed a me th(×lology 
based on using only very simple lools. For editing, we use 
Word and its styling facility, because no editor of 
slructured document is available on the Macinlosh. For 
iinl~oriing and exporting, we use I/l', a simple specialized 
langulige for writing transcriptors, and transfonn between 
representations (normalized ASCII, I{TF, etc.). Finally, 
we have developed ALEX, a dictionary tool, io support 
the electronic form. The methods defined have been 
applied on the FEM dictionary. They concern Ihe 
correction of errors which can :lppears in a manually buill 
dictionary and the formatting o1" this dicliona,y. 
We inlmduee first in more dot,ills lira situation we face 
with the FEM project. We expose the goals we tend to 
reach. Then we gives our generic methods and their 
applications to the specific case of the I:EM dictionary. 
I. Situation 
1. Presentation 
The FFM dictionary is composed of two pa,ts: a general 
one (12,000 entries) and a specific one for 
computer-science tenniuology (2,300 Icnns). Both paper 
and elcctmnic forms will be iuodnced by mid-94. 
We have initially received ASCII files obtained firslly by 
oplieal characters recognition and corrected manually in 
which the informalions of the French-Faiglish and 
l{nglish-Malay dictionaries have been crossed. 
2. Logical form 
The printed form of dictionaries rellect their internal 
structure (Boguraev 1990, Byrd, Calzolari, Chodomw & 
al. 1987). This slrucinre can lie modelizcd wilh a logical 
\['el'Ill whidi gives the sequence of the informations 
conlained by the dictionary. This logical feral contains 
entries, prononciation p:irls, spelling wu'ianls, 
grammatical categories, sonianlics informalion, 
silb-enlries, elc. We have defined a logical form of the 
arlielo in Ihe FI{M dicfion:uy (fig. 1). 
(Masculine by default) 
Pronu nL'ialioll 
Fntty variant 
l~ronunciation variant 
PluralJbrm 
Phtral Jbmi pronunciation 
Masculine pluraljbnn 
Masculine plural Jbnn pivmmciation 
Mast plural form pivnunciation variant 
Feminine form 
Feminine form pronunciatimt Main part of 
Feminine plural fi,'m an entry 
Feminine plural feral prommcialion 
Plural form variant 
Plurol fomt variant pronunciation 
French granmlar category 
Gloss (in French) 
(semantic categories) 
lztbel lOiglish equivalent 
F.n glish equivalent 
Lahel Malay equivalent 
..... M!!hly equiya!e!l t .......... 
Frt'llt'h I)hrast! (,"filch fig Cl~lnpotlnd words) \]llll.gll'aiive 
l'2nglish phrase equivalent 
Malav phrm'e e( uii ah'nt ph wires 
Sub-ermy (in French) Nuh-entries \[with the same structure as an entry~ 
Cross-rtfetence marker Cl'oxs 
Cross-reference ento, references 
Fig. l: logical form of an entry hi the FEM dictionary 
(tile lines in italic are optional in a entry) 
3. ASCII normalized external form 
A label is linked with every lype of infi)rmation of lira 
logical lkmn and is included in the inilial ASCII files. 
Thus, IJSM have obtained basic entries such as this given 
in fig. 2 which corresponds to the fi'euch enlry "accit&nt" 
(the label 'e' corresponds Io Ilie French enlry, 'pcn' to 
pronunciation, 'c' to grammatical calegory, etc. 
283 
e,accident 
pcn,/aksid'/ 
c,n.m. 
ee,accident 
me,kemalangan 
p, accident de train/d'avion 
epe, train/plane crash 
mpe,kemalangan keretapi 
mpe,nahas kapal terbang 
sbe,accident@ 
pcn,/aksid~te/ 
c,a. 
ee,damaged (in an accident) 
me,rosak (da\]am kemalangan) 
ee,hurt (in an accident:) 
me, tercedera (da\]am kemalangan) 
g,(terrain) 
ee,uneven 
me,tidak rata (kawasan, daerah) 
ee,hilly 
me,berbukit 
Fig. 2:alabelledentry 
II. Goals 
We pursue four goals in this project. They a,c listed 
Ix:low in order of importance. 
1. Paper formatting 
Our first aim is to produce from the ASCII normalized 
form a paper form of the FI'M dictionary with a format 
approaching that of usual dictionaries (fig. 3). This 
involves the introduction in d~e format of fonts, styles, etc. 
accident lak'ddCW n.m. accident : kemalangan, 
(kejadian) Iidak sengaja,(kejadian) secara 
kebetulan -- accident de train/d'avion train/plane 
crash: kemalangan keretapi, nahas kapal terbang 
-- aeehlent6 /abd&Ste/ a. damaged (in an accklent) : 
rosak (dalarn kemalangan) hm't (in an accident): 
lercedera (dalam kemalangan) (te,'rain)nneven: 
tidak rata (kawasan, daerah) hilly : berbukit. 
Fig. 3 : an entry of the pul~lishal~le paper FF, M dictionary 
2. Electronic formatting 
We also produce an electrcmic form. This electronic 
dictioua,y is supported by of a generic mullilingual 
dictiona,'y tc~l, ALEX. The problem is to keep as much :is 
ix~ssible of the logical loruL so as to allow logical access 
such as searching on multiple keys, sorting, etc. 
3. Dictionary revision 
"Crossing" of the French-Euglish and English-Malay 
dictionaries has been made manually by people who were 
not lluent ill French. Thus, some errors remaiu in both the 
logical structure and in the content. These errors have 1o 
Ix: corrected before producing lhe final paper form of the 
FEM dictionary. 
4. Phonetic codes conversion 
USM did not use the standard phonetic transcripticm 
(international phonetic alphabet - IPA), lint a local 
transcription using certain ch'u','lcters of Ihe Tilnes TM font, 
which looks like characters of the IPA. These characters 
have high ASCII code (128 to 256), thus this rendering is 
different according to the Ibm. To be portable to PC for 
instauce, the files must contain only lower ASCII 
characters (32 to 128). 
I!1. Methodology 
Our methodology is genetic enough to be applied to other 
projects dealing with the construction of real-size 
publishable human-oriented dictionaries. The 
methodology is based on the use of simple but powerful 
l{~ls. 
1. Use of an editor for correcting errors 
The problem is to find an appropriale software for this 
work. The first type of software is databases but our 
experiences with them (we have used dBASE III and all)) 
show that lexicographers don't like to work through 
DataBase Management Systems. They want to use the 
same word processor tc~ see the texts they want to index 
and to construct the dictionary. 
The most practical tool would bc an edilor of structured 
documents like Grif (Andr6, Furuta & Quint 1989, Phan 
& Boiler 1992) which can manage the logical form of tile 
dictionary, llowever, such editors are complex Io learn 
and are not yet availal)le on micros as Ihey require lmge 
computing ressources. \]Ience, we use Word, a wklely 
awlilable com,ncrcial word processor. 
We approach this notkm of structured documents by using 
Word's "styling" facility. A Word style is a group of 
paragraph and characters format with a name (e.g. tile title 
of this section has the style 'Titlel' which includes tile 
information about the rendering of this lille). We associate 
a particul:lr style to each logical type o1" infer,harlem in the 
dictionary. 
aeeldent 
/a ksidCff 
11 .nl. 
accident 
kelnala n.e, an 
(kejadian) tidak sengaja 
kejadian) sec/~ra kebetuh!.n 
accident de train/d'avion 
Irain/pl ane crash 
kenl;ihlllga n keretapl 
rialtos kalml terlmng 
aeeklenl6 
/a k'-;i d ~71tt q 
a. 
damaged (in an accident) 
r~sak (d'41am kemalanzan) 
hurt (in an accklent) 
tereedera (dalam kemalangan) 
(terrain) 
IllleVen 
tidak rata (kawas~,ll|, daerah) 
hilly 
berlmkit 
Fig. 4 : an entry with styles 
2. Use of an SLLP for converting formats 
To convert the initial norm:flizcd ASCII external fcmn 
(fig. 2) ill a printable form (llg. 3), we propose some 
solutions: 
284 
the first solution is In use Word's macro facility. 
Unfortunately, that facility is euly available on the 
PC version, and we lmmd it very clumsy to 
constantly exchange large files bctwcen the PC and 
the Macintosh, not speaking of unexpected chantcter 
W, msfonnafion in tim phonetic fent. 
the second solution is to use transducers, I)tfl the 
commercial Iranscriptors awtilable are only based OIL 
direct correspondences. They c:mnot take into 
acct')unl a ferw,'u'd context at|d they generally have 
no variables (or notion of st:de). Thus, they arc not 
p(~werfuI cn(mgh 15r the problems at hired. 
We used I:l" (l;mgtmge of Transcriptions), a Specialized 
l.anguage for l,inguistic ProgrammiLlg for wrilting 
trallscriplors. 
LT transducers have one input tape with two reading 
heads (one standard head and one forward head) and one 
writing head. They can also handle wuiables and produce 
side effects. Thus, this kind of transeriptors is not 
reversible in general. 
There have been p,'evious VCl'sions of I,'1" (l,cpage 1986) 
The I,T llSed ill our wo!'k has been implemented on 
MacilLtosh with CLOS (Common Lisp Oh.jeer System) 
(l,afeurcade 1993) The realizalion is lmscd mainly on 
Lisp Macr()pt(}granmfing on the top of an Automaton 
Manager. 
:---L.J: LI script editor ~ ~: 
T'l-aliSCl{lltCl, I from lover cane lie upper case 
@ 
iii 
# 
transcri plar lower- :,tippet- 
initial state is init 
fzom it~.t Io init by 
read "a" then vrite "A" 
read %" then write "B" 
read "c" then write "C" 
rend "d" ~.en wTite "D" 
read "z" then "trite "Z" 
read l chs._r'~ter then write it 
english ~ \] 
Fig. 5: all l:I' transcriptor example 
With I:l', we have easily writtca all necessary cenvcrtets. 
Phonetic transcriptions 
These conversions first conccra the prol)lem of special 
characters used in some fonts, especially the chqraclers 
used at USM (standard macintosh toots, i.e. courier er 
times) to approximate the intcmatielml phonetic alphabet 
(IPA). For ex:unple, the ' sign (as in/aksid'/) appcars only 
in a standard macintosh font. 
We have lhus defined three form.'tts. Ph:l is the initial fo,m 
of the Word files in a standard macintosh font (files built 
at IJSM). Ph2 is the format where special characters are 
replaced by others which appear in qll usual fonts 
(characters corresponding with tile letters, the numbers 
and with the '+', '-' signs, i.e. 7-bits ASCII). This ASCII 
coding authorises a safe exchange between Macintoshes 
and P('s. 
aborigtme:/ab>ri3 )-'n/ --> /ABORIJI :+N/ 
To transcript from the Phl to the Ph2 formats, we use the 
I,'1" traascriplor pltTl tcoPh2. All exccrr~t is given below. 
transcriptor Ph I toPh2 
initial st'.tlc is illil 
froln iuit to init via 
read "~" then write "F.+" 
read ">" then write "O" 
read "3" then wrile "J" 
read "e" then write "1';-" 
read "it" then write "A-" 
Fig. 6: excerpt of Ihe 1:I' transcriptm' Phi toPh2 
l'h3 is tile IPA phonelic tornl'll, l~alPhon is the font used 
for this 1PA transcription. The problem is to assign this 
font only to the lines which corrcspoad in phonetic 
transcription, and hence to dclcLlllinate the right. 
lle,e, we work oe the RTF (Rich Text l:ornl'll) l'onnal, 
directly produced by Word, which records all the 
informatious describing Wo,d documents (styles, fonts 
and olher informations as italic, I×fld, elc). 
Figure 7 presents file RTF form for the line 'c,accident'. 
\s0 corresponds In lhe 'Normal' style of Wortl aad \120 
corresponds 1o tile slalld,|rd lllacilllosh l'ot/l. 
~aj~ \sO \f20 e,accident 
Beginning of paragraph End ol paragraph 
Style number 
Other informations 
Text o\[ the paragraph 
Fig. 7: description of the RTF 
Then, Ph2 LoPh3 realizes the transilien from Iq~2 te Ph3. It 
hansforms Ihe RTF form of the lines corresponding Io 
prolltlliCialioa (pholletic) by eotLvcrlillg Ihe "l'illlCS foil\[ 
code (\f20) to Ihe l~'all'h(m fonI code (\fl \[3S) alld each 
charac|ef ill Ph2 form In \[he IPA forth. 
aborigine:/ABORIJI:,-;N/ --> /ab~ri~t:n/ 
AlL excerpt of Ph2LoPh3 is given Imh3w. The code RTF 
\'al) corrcspends to the ch:uacteL .... ill the l";.tlI)hotl fold, 
\'bf to '0', etc. 
tr:mscrildor Ph2tolql3 
initial state is ini! 
fl'om init to init via 
read "F~+" the,i write "{\f113R Vah}" 
read "O" then write "{\f113{{ Vbl'}" 
,'cad "J" then write "{\f1138 Vbd}" 
read "F.-" then write "{\f\[ 13g el" 
read "A-" then write "{\fl 138, \'8c}" 
Pig. 8: excerpt of the 1:1" Irallseripll~r Pi~2lePh3 
285 
External format 
Tim first conversion type was about the problem of 
special characters rendering in the dictionm-y. The second 
concerns the extern,'d format of the diction,'u-y. We have 
defined three formats: 
An: the ASCII normalized form which corresponds 
to the initial files (ehl), these files with phonetic 
encoding (Ph2) and these files in the RTF format 
(Ph3) 
WT: tim Word transitory form wbich corresponds to 
tile stylized files with phonetic encodiug (Ph2) and 
and these files in the R lq v format (ph3) (lag 4) 
vp: tile Word printing fern (fig 3) in which we have 
c,mceled every informations about styles but we kept 
the other informations as fonts code and other 
chm acters formats (Ph3) 
The conversions between AN ~e and wp forms are made 
with LT transcriptors 
3, Use of a dictionary tool 
Alex is a simple and easy to use generic dictionary tool. 
Its functionalities ,are quite classical (inserting and 
deleting items, sorting, searching). The interesting 
features are tim possibility to index a base on several keys 
and to search according to these keys or tile content of any 
non-indexed entry (although it is slower). 
Enu'ies can be structured objects and se,'u'chcs can be done 
in function of the values of the features. A stone base can 
handle heterogeneous objects. 
w Rhr,-d,t h@~ ~ vtr 9~1~ flx~ 
~ithmlt Ique ~ vV~le fiettmle 
........... .,o~t~ ~iiilliihlLlt~liii!lllllillltllilliilil ~rllrt. i!ili:l:;,i':~ 
l i iltlnte lille: l lilt t~i p~ dll kill 
ttr'but it,~l~ @ r.'/.~We default attributo 
,re~.at~ a.r..l~, atritlut lalal 
t t t~k~at ~o~ Ioaer 
to-adilkm 
1 ll~tu=~ 232o t141~t# dtf~ \] 
Fig. 9: an exmnple of the electronic form 
(terminology entry "attritmt p~ ddfaut") 
It is possible to pilot ALEX remotely (instead of 
interacting with it via the user-interface) and this nlelhod 
has been used to fill the FEM electronic base. 
To do so, we have written an LT transcriptor wilh strong 
side effects on ALFX. The go,d, here, w,'t~ not to produce 
a result in term of a trm~scripted file, but instead to read a 
file and produce actions on the ALFX b,~se. As any dialect 
of LT can mix Lisp commands in their script, it was 
possible to make these tools cooperate. 
Conclusion 
The methodology for manipulating human-oriented 
dictionaries presented in this paper is based on simple bat 
powerful tools which can be used by lexicographers who 
don't want to spend much dine learning how to use 
structured doemnents editors and even less, how to 
progr,'urt in I)BMS. We use Word, a commercial word 
processor; LT, a language of transcriptions; ALF.X, a 
diction,'uy tool. Contrary to our initial fears, these simple 
tools proved very convenient, and powerft, l enough for 
the tasks at h,'md. 
LT and ALFX will soon be av'filable by anonymous ftp at 
cambridge, apple, com. 
Acknowledgements 
We ,are grateful to Chuah C. Kim and Z'lrin Y. from the 
University Sains Malaysia for their patience and their 
corrections of the Malay part of the FEM dictionary We 
wish to thank C Boitet II lllanchou G Sdrasset and 
other colleagues from GIglA for their support their help 
and their remarks All re,naining deficiencies are of 
course, OtU'S. 
References 
Andr6, J., R. Furuta and V. Quint (1989) Structured 
Documents. Cambridge University Press, Cambridge, 
220 p. 
Boguraev, B. (1990) Data Models for Lexicon 
Acquisition. Prec. International Workshop on 
Electronic l)ictionaries, November 8-9, 1990, wfl. 
1/1, pp 70-86. 
Byrd, R. J., N. Calzolari, M. S. Chodorow, J. 1.. Klavans, 
M. S. Neff and O. A. Pdzk (1987) Tools and Method 
for Computational Linguitics. Computatiotml 
Linguitics, 13/3-4, pp. 219-240. 
LaRmrcade, M. (1993) Inside LT: GETA, IMAG, 
Teclmical Report GI'~TA, September 93, 47 p. 
Lcpage, Y. (1986) A languoge for transcriptions. P,(~c. 
COI ING86 Bonn, IKS pp402404 
Phau, 11. K. and C. Boiler (1992)M, ltiling,izalion q/'an 
editor fi~r structured docttments. Application to a 
trilingual dictionao,. Prec. COTING-92, Nantes, 
juillct 1992, C. Boiler, cd., ACI., wd. 3/4, pp 966- 
971. 
-()-O-O-O-O-O- 
286 
