LDVLIB(LEH): A SYSTEM POR INTERACTIVE LEHMATIZING AND ITS 
APPLICATION 
R. Drewek, M. Erni 
Seminar of Romance Languages, University of Zurich/ 
Switzerland 
A concrete pro~.ect-like our "Concordanza lemmatizzata 
delle "Operette morali" di G. Leopardi" (a lemmatized concord- 
ance of an italian text of the 18th century with some archaic 
phenomena and of about 70 "000 tokens and 9 "500 %Tpes) is a 
good opportunity to introduce a new software package for 
lin~tistic data processing not. as mere cumulation of routines 
or statements but as a com$ortable tool Just in use. 
LDVIJ3 is no experimental, single language dedicated 
and fragile collection of algorithms. It tries to provide fast 
and reliable standard procedures for everyday Jobs in linguist- 
ic and literary research and sometimes even a bit more. The 
package consists of 34 programs and 41 modules, mainly writt- 
en in PL/1. They have been carefully developped in the last 
seven years and been tested in varAous research projects since 
then. The programs can be grouped by purpose: 
- text .preparation (editing, correcting and printing) 
- text corpus handling 
- lexical text analysis, lexicostatistics 
- statistical string description (length phenomena) 
- machine dictionary management 
- production of indices, frequency dictionaries and 
concordances 
- lennatization 
- analysis of spoken language texts 
- 86 - 
- content ana2.yeis 
- utilities for bibliograph~ee, document preps~ation, 
g~aphics and ~phemes 
Whereas programs can be used by the non pro~-s~ 
researcher commnicat~m~ with the pro~m by ke~ord orient- 
ed mad widely unfomatted co.sand language, a set of, modules 
is thou~t to support the pro~w~aing linguist in the fields 
of striug manipulation, word and word list ma~pulation, 
dictionary haudli~, VDU fullscreen co~munioations, print 
plot and other purposes. 
All programs which produce numerical output from stat- 
istical analysis provide a data interface to input well known 
statistic software like SI~S or SAS. The text coding rules 
are oriented on the printed original with a few restrictions 
which can easily be learned even by non trained personal. The 
character set is able to receive any roman transliteration of 
languages using different ~aphemes, even old Egyptian hiero- 
glyph texts were analyzed by LDVI.~ programs. 
The complex task of producing a concordance claims a 
lot of facilities given by LDVLI3 programs. The "crucial 
point" of lemmatization must be discussed to define an 
appropriate interface in man-machine interaction to obtain 
reasonable philological results. Our design of an interactive 
lemmatizer m~7 be useful to show not onl~ mau-machine inter- 
action but computational linguist/literary expert interaction 
as well. And it might reveal the lack of lingulstlcally te- 
llable algorithms for a fully automatic approach to this 
problem. 
LDVLIB(LEM) doesn't lenm~tize automatically but it 
supports lemmatization as follows. It allows to work on sing- 
le portions of a text and one or more users have access to 
the on-line machine diotions~7 at the same time. The user 
gets presented on the screen: 
- 87 - 
- in the upper part, from the KWlC-concordance: 
every token to be lemmatized,with context and referen- 
ces (page, line) 
- in the lower part, from the machine dictionary: 
proposals of lemmatizing relative to the type shown 
in. the upper part. 
Interactive lemmatizing consists therefore in recording the 
(automatically generated) number of the convenient proposal 
in the line of the token. If there doesn't result any proposal 
or not a convenient one from the machine dictionary, the user 
will insert innnediately the convenient dictionary entry and 
record its proposal number in the upper part of the screen. 
Such a new proposal will be stored in an additional dictionary 
that is to be transferred periodically into the main diction- 
ary. 
The always growing ~chin e dioticnar~ bases on a nation- 
al language frequence vocabulary of about 25 "000 types includ- 
ing about 5;000 lemmata. There has been put a lot of care in 
the design of the information codes. The machine dictionary 
entries consist of 4 fields: type (inflected wordform), lemma 
(deflected keyword), lemma information and type information. 
The lamina information includes the following segments: 
- word class and additional informatlons 
- additional lemmata (enolitio article, pronouns) 
- disambiguation of homography 
- cross-reference to the standard lemma (to be generated 
in the printed output): 
- graphic variant of the lamina (archaic writing) 
- alteration of the le.-.- (diminutive by suffixation) 
- short paraphrase in case of homonymy, where dlsambig- 
uation is default (in case of polysemy, where dis- 
ambiguation is optional) 
The type information includes the following segments: 
- 88 - 
- morphological information (gender, number, person, 
mood, tense, case, gradation) 
- morphological variants (archaic inflexion) 
- graphic variants (elision, short form) 
- special, i.e. idiomatical use 
- relation to a distinct vocabulary (e.g. frequence 
vocabulary) 
The users of concordances (le-,,-tized or not) have 
different interests. In literar~ research one may study the 
si~Ele types or even merely the single tokens of a !e""" in 
the order of occurrence in a work. In linguistic research one 
may be interested in alphabetic order of the types and in sub- 
sequent alphabetic order of the right context of the single 
tokens. These two examples of ordered concordances don't need 
the type information. But the type information as provided in 
our machine dictionary will allow to get s~ more sophisticated 
internal order of the lemmata: e.g. singular preceeds plural, 
positive preoeeds comparative and superlative, present pre- 
ceeds past, morphological and graphic variants are distinguish- 
ed or not, idiomatical uses are ordered separately or not. 
The access to a lemmatized concordance will be as to a 
data base and the lin@~tist interested in certain phenomena may 
select by options e.g. the substantives and adjectives only 
or all verbs in passive construction. LDVLIB(LEM) allows al- 
ways to the user to get full print of the lemnatized concord- 
ance or a reduced print of a list of 1 to n lemmata. 
It will be shown that support of the philologist °s work 
by a large dictionary is not only useful in concordance mak- 
ing, but as well cumulates a lot of material for subsequent 
lexicographic work. Looking ahead, two questions "must be con- 
sidered: the integration of a dictionary data base and the 
productive use of grammatical procedures like ATNs to shift 
balance between intellectual work and machine support in 
direction to "a little bit more automatic". 
- 89 o 
