Proceedings of the Workshop on Multilingual Language Resources and Interoperability, pages 25–31,
Sydney, July 2006. c©2006 Association for Computational Linguistics
The LexALP Information System: Term Bank and Corpus for Mult i-
lingual Legal Terminology Consolidated 
 
 
Verena Lyding, Elena Chiocheti 
EURAC research 
Viale Druso 1, 39100 Bozen/Bolzano - Italy  
forename.name@eurac.edu 
Giles Sérasset, Francis Brunet-Manquat 
GETA-CLIPS IMAG 
BP 53, 38041 Grenoble cedex 9 - France 
forename.name@imag.fr 
 
  
 
Abstract 
Standard techniques used in multilingual 
terminology management fail to describe 
legal terminologies as they are bound to 
different legal systems and terms do not 
share a common meaning. In the LexALP 
project, we use a technique defined for 
general lexical databases to achieve cros 
language interoperability between lan-
guages of the Alpine Convention. In this 
paper we present the methodology and 
tols developed for the colection, de-
scription and harmonisation of the legal 
terminology of spatial planning and sus-
tainable development in the four lan-
guages of the countries of the Alpine 
Space. 
1 Introduction 
The aim of the LexALP project is to harmo-
nise the terminology used by the Alpine Conven-
tion, both for internal purposes and for comu-
nication among the member states. The Alpine 
Convention is an international treaty signed by 
all states of the Alpine territory (France, Monaco, 
Switzerland Liechtenstein, Austria, Germany, 
Italy and Slovenia) for the protection of land-
scape and sustainable development of this moun-
tain area
1
. The member states speak four differ-
ent languages, namely French, German, Italian, 
and Slovene and have different legal systems and 
traditions. 
Hence arises the need for a systematization 
and uniformation of terminology and clear trans-
lation equivalence in all four languages. For this 
reason, the project intends to provide all 
                                                
1
 cf. also htp: ww.alpenkonvention.org 
stakeholders and the wider public with an infor-
mation system which combines three main com-
ponents, a terminology data base, a multilingual 
corpus and the relative bibliographic data base. 
In this way the manually revised, elaborated and 
validated (harmonised) quadrilingual information 
on the legal terminology (i.e. complete termino-
logical entries) wil be closely interacting with a 
facility to dynamically search for additional con-
texts in a relevant set of legal texts in all lan-
guages and for all main legal systems involved. 
2 Multilingual legal information system 
The information system for the terminology 
of the Alpine Convention, with a specific focus 
on spatial planning and sustainable development, 
wil give the posibility to search for relevant 
terms and their (harmonised or rejected) 
translations in all 4 official languages of the 
Alpine Convention in the first module, the term 
bank. Next to retrieving synonyms and 
translation equivalents within each legal system, 
the user wil be provided with a representative 
context and a valid definition of the concept 
under consideration. Source information wil be 
provided for each text field in the terminological 
entry. 
Via a link from the terminological data base 
to the second module, the corpus facility, the 
information system wil give the posibility to 
search the corpus for further contexts. 
Finally, both term bank and corpus wil be i n-
teracting with a third module, the bibliographic 
database, so as to allow retrieving ful informa-
tion on text excepts cited in the term bank and to 
store important meta data on corpus documents. 
25
3 Terminological data 
3.1 Data categories and motivations 
The data categories present in the terminology 
database allow entering and organising relevant 
information on the concept under analysis. The 
term bank interface allows entering of the fol-
lowing terminological data categories: denomi-
nation/term, definition, context, note, sources 
(text fields), grammatical information to the term, 
harmonisation status, processing status, geo-
graphical usage, frequency and domain, accord-
ing to the appositely elaborated domain classifi-
cation structure
2
 (pul down menus). Again by 
means of pul-down menus the terminologist wil 
be able to signal to the users which terms are al-
ready processed (i.e. checked by legal experts), 
harmonised or rejected and - most important - to 
which legal system they belong (the menu geo-
graphical usage allows to specify this informa-
tion). Furthermore it is posible to specify syno-
nyms, short forms, abbreviations etc. in the ter-
minological entry and, if necesary, link them to 
the relative ful information already present in 
the term bank (however, no direct access to these 
linked data is posible, this must be done via the 
search interface). Finally, the terminol ogist is 
given the posibility of writing general com-
ments to the entry. At the very end of one lan-
guage entry the terminologist can decide whether 
to release the data to the public (by clicking on 
the buton ‘finish’) or keep it for further fine-
tuning (buton ‘update’). 
Each term is created in its ‘language volume’ 
and described by means of all necessary informa-
tion. As son as one or all equivalents in the 
other languages are available to, the single en-
tries can be linked to each other with the help of 
an axie (see detailed description below). 
Searches can be done for all languages or on a 
user-defined selection of source and target lan-
guages. Presently the database allows global 
searching in all text fields and filtering by source, 
author, date of creation, as well as by axie name 
and ids. Results can be displayed in ful form, as 
a short list of terms only or in XML. Some ex-
port/import functions are granted. 
As the term bank serves mainly the scope of 
diffusing harmonised terminology, the four trans-
lation equivalents (validated by a group of ex-
perts) are displayed together, whereas rejected 
synonyms are displayed separately for each 
search language. In this way the user may well 
                                                
2
 Se also 4.1  
lok for a non validated synonym and find it in 
the database but be warned as to which is the 
preferred term and its harmonised equivalents in 
the other languages. Figure 1 shows such a situa-
tion where the French rejected term “transport 
intra-alpin” is linked to the harmonised term 
“trafic intra-alpin”. 
 
 
Figure 1: A set of Alpine Convention terms and 
their relations 
3.2 Monolingual data 
The LexALP term bank consists in 5 volumes 
for French, German, Italian, Slovene and English 
(no data is being entered for this fifth language at 
the moment), which contain the term descrip-
tions. The set of data categories is represented in 
an XML structure that folows a common 
schema.  
<entry id="fra.trafic_intra-alpin.1010743.e" 
    lang="fra" 
    legalSystem="AC" 
    proces_status="FINALISED" 
    status="HARMONISED"> 
 <term>trafic intra-alpin</term> 
 <gramar>n.m.</gramar> 
 <domain>Transport</domain> 
 <usage frequency="comon" 
     geographical-code="INT" 
     technical="false"/> 
 <relatedTerm  
  isHarmonised="false" 
  relationToTerm="Synonym" 
  termref="fra.transport_intra-alpin…"/> 
 <relatedTerm  
  isHarmonised="false" 
  relationToTerm="Synonym" 
  termref="fra.circulation_intra-…"/> 
 <definition> 
  [T]rafic constitué de trajets ayant leur  
  point de départ et/ou d'arivée à  
  l'intérieur de l'espace alpin. 
 </definition> 
 <source>Prot. Transp., art. 2 </source> 
 <context url="htp:/ww.."> 
  Des projets routiers à grand débit pour  
  le trafic intra-alpin peuvent être  
  réalisés, si [..]. 
 </context> 
</entry> 
Figure 2: XML form of the term ‘trafic intra-
alpin’ 
Each entry represents a unique term/meaning. 
Terms with the same denomination, but belong-
26
ing to different legal systems have, de facto, di f-
ferent meanings. Hence, different entries are cre-
ated. Terms with different denominations but 
conveying the same ‘meaning’ (concept) are also 
represented using different entries
3
. In this case, 
the entries are linked through a synonymy rela-
tion. 
Figure 2 shows the XML structure of the 
French term “trafic intra-alpin”, as defined in the 
Alpine Convention. The term entry is associated 
to a unique identifier used to establish relations 
between volume entries. 
The example term belongs to the Alpine Con-
vention legal system
4
 (code AC). The entry also 
bears the information on its status (harmonised 
or rejected) and its processing status (to be proc-
essed, provisionally processed or finalised). 
In addition, a definition (along with its source) 
and a context may be given. The definition and 
context should be extracted from a legal text , 
which must be identified in the source field. 
3.3 Achieving language/legal system 
interoperability 
As the project deals with several different le-
gal terms, standard techniques used in multilin-
gual terminology management need to be 
adapted to the peculiarities of the specialised 
language of the law. Indeed, terms in different 
languages are (generally) defined according to 
different legal systems and these legal systems 
canot be changed. Hence, it is not posible to 
define a common ’meaning’ that could be used 
as a pivot for language interoperability
5
. In this 
respect, legal terminology is closer to general 
lexicography than to standard terminology.  
In order to achieve language/legal system 
interoperability we had several options that are 
used in general lexicography.  
Using a set of bilingual dictionaries is not an 
option here, as we have to deal with at least 16 
                                                
3
 Variants, acronyms, etc. are not considered as di f-
ferent denominations. 
4
 Strictly speaking, the Alpine Convention does not 
constitute a legal system per se. 
5
 Consider for instance the diference betwen the 
Italian and the Austrian concepts of journalists’ pr o-
fesional confidentiality. Whereas the Redaktionsge-
heimnis explicitly underlines that the journalist can 
refuse to witnes in court in order to kep the profes-
sional secret, in Italy the segreto giornalistico must 
obligatorily be lifted on a judge’s request. The two 
concepts have overlaping meanings in the two states, 
however, they diverge greatly with respect to the be-
haviour in court. 
language/legal system couples (with alpine Con-
vention and EU levels, but without taking into 
account regional levels). Moreover, such a sol u-
tion wil not reflect the multilingual aspect of the 
Alpine Convention or the Swis legal system. 
Finally, building bilingual volumes between the 
French and Italian legal systems is far beyond the 
objectives of the LexALP project. 
Another solution would be to use an “Eu-
rowordnet like” approach (Vosen, 198) where 
a specific language/legal system is used as a 
pivot and elements of the other systems are 
linked by equivalent or near-equivalent links. As 
such an aproach artificially puts a language in 
the pivot position, it generally leads to an “eth-
nocentric” view of the other languages. The ad-
vantage being that the architecture uses the bili n-
gual competence of lexicographers to achieve 
multilingualism.  
In this project, we chose to use ‘interlingual 
acceptions’ (a.k.a. axies) as defined in (Sérasset, 
194) to represent such complex contrastive 
phenomena as generally described in general 
lexicography work. In this aproach, each ‘term 
meaning’ is associated to an interlingual accep-
tion (or axie). These axies are used to achieve 
interoperability as a pivot linking terms of differ-
ent languages bearing the same meaning. 
However, as we are dealing with legal terms 
(bound to different legal systems), it is generally 
not posible to find terms in different languages 
that bear the same meaning. In fact such t erms 
can only be found in the Alpine Convention 
(which is considered as a legal system expressed 
in all the considered languages). Hence, we use 
these terms to achieve interoperability between 
languages. In this aspect, we are close to Eu-
rowordnet’s approach as we use a specific legal 
system as a pivot, but in our case the pivot itself 
is generally a quadrilingual set of entries.  
These harmonised Alpine Convention terms 
are linked through an interlingual acception. An 
axie is a place holder for relations. Each interlin-
gual acception may be linked to several term en-
tries in the languages volumes through termref 
elements and to other interlingual acceptions 
through axieref elements, as ilustrated in 
Figure 3. 
<axie id="axi.101424.e"> 
 <termref  
 idref="ita.trafico_intralpino.1010654.e"  
 lang="ita"/> 
 <termref  
 idref="fra.trafic_intra-alpin.1010743.e"  
 lang="fra"/> 
 <termref 
 idref="deu.ineralpiner_Verkehr.101065.e"  
27
 lang="deu"/> 
 <termref  
 idref="slo.znotrajalpski_promet.101132.e"  
 lang="slo"/> 
 <axieref idref="/> 
 <misc></misc> 
</axie> 
Figure 3: XML form of the interlingual aception 
ilustrated Figure 1 
The termref relation establishes a direct 
translation relation between these harmonised 
equivalents. Then, national legal terms are indi-
rectly linked to Alpine Convention terms through 
the axieref relation as ilustrated in Figure 4. 
 
Figure 4: An example French term, linked to a 
quadrilingual Alpine Convention Term. 
4 Corpus 
4.1 Corpus content 
The corpus comprises around 300 legal 
documents of eight legal systems (Germany, It-
aly, France, Switzerland, Austria, Slovenia, 
European law  and international law with the 
specific framework of the Alpine Convention,) 
(see table 1). 
 
AT CH DE FR IT SI AC EU INT 
612 19 62 613 490 213 38 791 149 
Table 1: Corpus documents for each legal system 
Documents of the supranational level are pro-
vided in up to four languages (subject to avail-
ability). National legislation is generally added in 
the national language (monolingual documents) 
and in case of Switzerland (multilingual docu-
ments) in the three official languages of that na-
tion (French, German and Italian). 
The documents are selected by legal experts 
of the respective legal systems folowing prede-
fined criteria: 
• entire documents (no single paragraphs 
or excerpts etc.); 
• strong relevance to the subjects ‘spatial 
planning and sustainable development’ as 
described in art. 9 of the relative Alpine 
Convention Protocol; 
• primary sources of the law for every sys-
tem at national and international/EU level, 
i.e. normative texts only (laws, codes etc.);  
• latest amendments and versions of all 
legislation (at time of colection: June – 
August 205); 
• terminological relevance. 
Each document is classified according to the 
folowing (bibliographical) categories: ful title, 
short title, abbreviation, legal system, language, 
legal hierarchy, legal text type, subfield (1, 2 and 
3), oficial date, official number, published in 
official journal (date, number, page), … The bib-
liographical information of all documents is 
stored in a database and can at any time be con-
sulted by the user. 
The subfields have been elaborated and se-
lected by a team of legal experts, taking into ac-
count the classification specificities folowed by 
the Alpine Convention and the need to classify 
texts from several different legal systems accord-
ing to one common structure. For this reason, the 
legal experts have subdivided the fields spati al 
planning and sustainable development into 5 
main areas, in accordance with the Alpine Con-
vention Protocol dealing with these subjects and 
subsequently adopted an EU-based model for 
further subdividing the 5 main topics in such a 
way that all countries involved could classify 
their selected documents under a maximum of 3 
main items, the first of which must be indicated 
obligatorily. This classification allows an easy 
selection of all subsets of documents according 
to subject field. 
 
Figure 5: Example of document clasification 
28
<header  
 lang="ita" 
 creator="X" 
 created="Fri Feb 17 10:45:15 CET 206"> 
<h.title> 
 Lege_regionale_25974.14_87.txt 
</h.title>  
<bibID> 
 17658 
</bibID> 
</header> 
Figure 6: XML-header of corpus documents 
<text id="17658"> 
<body id="17658.b"> 
<div type="intro" id="17658.b.i"> 
<p id="17658.b.i.p1"> 
<title id="17658.b.i.p1.ti1"> 
LEGE REGIONALE 15/05/1987, N. 014  
Disciplina del' esercizio […] di 
fauna selvatica. 
</title> 
</p> 
</div> 
<div type="section" id="17658.b.c0.se1"> 
<p id="17658.b.c0.se1.p1"> 
<title id="17658.b.c0.se1.p1.ti1"> 
Art. 1 
</title> 
</p> 
<p id="17658.b.c0.se1.p2"> 
<s id="17658.b.c0.se1.p2.s1"> 
1. Sul' intero teritorio  
regionale la cacia seletiva  
per qualita', […] 
</s> 
<s id="17658.b.c0.se1.p2.s2"> 
a) capriolo: dal 15 magio al 
 15 genaio;  
</s> 
<s id="17658.b.c0.se1.p2.s3"> 
b) cinghiale: dal 15 giugno  
al 15 genaio;  
</s> 
</p> 
<p id="17658.b.c0.se1.p3"> 
<s id="17658.b.c0.se1.p3.s1"> 
2. E' ameso l' uso […]  
</s> 
</p> 
</div> 
</body> 
</text> 
Figure 7: XML-structure of corpus document 
4.2 Structural organization of corpus data 
Colected in raw text format (one file for each 
legal text) the documents are first transformed 
into XML-structured files and in a second step 
inserted into the database.  
The XML-annotation is done in compliance 
with the Corpus Encoding Standard for XML 
(XCES)
6
. Slightly simplified, the provided 
schema
7
 serves to add structural information to 
the documents. Each text is segmented into sub-
sections like: preamble, chapter, section, para-
                                                
6
 htp:/ww.cs.vasar.edu/XCES/ 
7
 htp:/ww.cs.vasar.edu/XCES/schema/xcesDoc.xsd 
graph, title and sentence. Furthermore, a link to 
the classification data (bibliographic data base) is 
inserted and, in case of multilingual documents, 
alignment is done at sentence level. 
The XML-annotated documents hold all the 
information needed for the insertion into the cor-
pus database, such as structural mark-up and bib-
liographical information. The ful text documents 
are transformed into sets of database entries, 
which can be imported into the database.  
4.3 Technical organization of corpus data 
Folowing the bistro approach as realized for 
the Corpus Ladin dl’Eurac (CLE) (Streiter et al. 
204) the corpus data is stored in a relational 
database (PostgreSQL). The information present 
in the XML-annotated documents is distributed 
among four main tables: document_info,  cor-
pus_words, corpus_structure, corpus_alignment.  
 
The four tables can be described as folows:  
document_info: This table holds the meta-
information about the documents; each category 
(like ful title, short title, abbreviation, legal sys-
tem, language, etc.) is represented by a separate 
column. For each legal document one entry (one 
row) with unique identification number is added 
to the table. These identification numbers are 
cited in the XML-header of the corpus docu-
ments. 
corpus_words: This table holds the actual 
text of the colected documents. Instead of stor-
ing entire paragraphs as it was done during the 
creation of CLE, for this corpus a different ap-
proach is being tested. Every annotated text is 
split into an indexed sequence of words, starting 
with counter one. Once inserted into the database 
a text is stored as a set of tuples composed of 
word, position in text and document id (as a ref-
erence to the document information).  
corpus_structure: This table holds all infor-
mation about the internal structure of the docu-
ments. Titles, sentences, paragraphs etc. are 
stored by indicating starting and ending point of 
the section. For each segment a tuple of segment 
type, segment id, starting point (indicated by the 
index of the first word), ending point (indicated 
by the index of the last word) and document id is 
added. 
corpus_alignment: This table defines the 
alignment of multilingual documents. By provi d-
ing one column for each language the texts are 
aligned via the document ids or via the ids of 
single segments. 
 
29
The tables are interconected by explicitly 
stated references. That means that the columns of 
one table refer to the values of a certain column 
of another table. As shown in figure 8 all tables 
hold a column document_id that refers to the 
document id of the table document_info. Fur-
thermore, the table corpus_structure holds refer-
ences to the column position of the table cor-
pus_words. 
 
 
Figure 8: Interconection of tables 
5 Searching the corpus 
Due to the fine-grained classification (see section 
4.1) and the structural mark-up (see section 4.2) 
of all corpus documents, corpus searches can be 
restricted in the folowing ways: 
• by specifying a subset of corpus docu-
ments over which the search should be 
carried out (e.g. all documents of legal 
system CH with language French); 
• by chosing the type of unit to be dis-
played (whole paragraphs <p>, sentences 
<s>, titles <title>, …); 
• by searching for whole words only (ex-
act match) or parts of words (fuzzy 
match); 
• by restricting the number of hits to be 
displayed at a time. 
For searches in multilingual documents it wil be 
posible to search for aligned segments, specify-
ing search word as well as target translation. For 
example, the user could search for all alignments 
of German-Italian sentences that contain the 
word Umweltschutz translated as tutela ambien-
tale (and not with protezione dell’ambiente). 
Figure 9 shows a simple interface for searching 
monolingual documents. 
 
Figure 9: Example search over monolingual 
documents 
6 Interaction term bank and corpus 
Term bank and corpus are independent compo-
nents which together form the LexALP Informa-
tion System. 
The interaction between corpus and term bank 
wil concern in particular 1) corpus segments 
used as contexts and definitions in the termino-
logical entries, 2) short source references in the 
term bank (and the associated sets of bibli o-
graphical information) and 3) legal terms. 
6.1 Entering data into term bank 
When adding citations to a term bank entry, the 
relative bibliographic information wil automati-
cally be counterchecked with the contents of the 
bibliographical database. In case the information 
about the cited document is already present in the 
DB, a link to the term bank can be added. Ot h-
erwise the terminologist is asked to provide all 
information about the new source to the biblio-
graphic database and later create the link. 
Next to static contexts and definitions present 
for each terminological entry, each entry wil 
show a buton for the dynamic creation of con-
texts. Hiting the buton wil start a context 
search in the corpus and return all sentences con-
taining the term under consideration. 
6.2 Searching the corpus 
When searching the corpus the user wil have the 
oportunity to highlight terms present in the term 
bank. In the same way standardised or rejected 
terms can be brought out. Via a link it wil then 
30
be posible to directly access the term bank entry 
for the term found in the corpus. 
In general each corpus segment is linked to the 
ful set of bibliographic information of the 
document that the segment is part of. Accessing 
the source information wil lead the user to a de-
tailed overview as shown in figure 4. 
7 Conclusion 
In this paper, we have presented the LexALP 
information system, used to colect, describe and 
harmonise the terminology used by the Alpine 
Convention and to link it with national legal ter-
minology of the alpine Convention’s member 
states. Even if we currently give a specific focus 
on spatial planning and sustainable development, 
the project is not restricted to these fields and the 
methodology and tols developed can be adapted 
to legal terminology of other fields.  
In this paper we also proposed a solution to 
the encoding of multilingual legal terminologies 
in a context where standard techniques used in 
multilingual terminology management usually 
fail. 
The terminology developed and the corpus 
used for its development wil be accessible on-
line for the stakeholders and the wider public 
through the LexALP information system. 
Acknowledgements 
The LexALP research project started in Janu-
ary 205 thanks to the funds granted by the IN-
TERREG IIIB “Alpine Space” Programme, a 
Community Initiative Programme funded by the 
European Regional Development Fund.  

References 
Giles Séraset. 194. Interlingual Lexical Organisa-
tion for Multilingual Lexical Databases in NADIA. 
In Makoto Nagao, editor, COLING-94, volume 1, 
pages 278—282, August. 
Streiter, O., Stufleser, M. & Ties, I. (204). CLE, an 
aligned Tri-lingual Ladin-Italian-German Corpus. 
Corpus Design and Interface, LREC 204, Work-
shop on "First Steps for Language Document ation 
of Minority Languages: Computational Linguistic 
Tols for Morphology, Lexicon and Corpus Com-
pilation" Lisbon, May 24, 204. 
Vosen, Piek. 198. Introduction to EuroWordNet. In 
Nancy Ide, Daniel Grenstein, and Piek Vosen, 
editors, Special Isue on EuroWordNet, Computers 
and the Humanities, 32(2-3): 73-89. 
Wright, Sue Elen 201. Data Categories for Termi-
nology Management. In Sue Elen Wright & 
Gerhard Budin, editors, Handbok of Terminology 
Management, volume 2, pages 52-569. 
