MULTEXT : Multilingual Text Tools and Corpora 
Nancy Ide and Jean V6ronis 
LABORATOIRE PAROLE ET LANGAGE 
CNRS & Universitd de Provence 
29, Avenue Robert Schuman 
13621 Aix-en-Provence Cedex 1 (France) 
e-mail: ide@fraixll .univ-aix. fr, veronis@fraixll .univ-aix. fr 
Abstract, MULTEXT (Multilingual Text "Fools and 
Corpora) is the largest project funded in the Commission 
of European Communities Linguistic Research and 
Engineering Program. The project will contribute to the 
development of generally usable software tools to 
manipulate and analyse text corpora and to create multi- 
lingual text corpora with structural and linguistic 
markup. It will attempt to establish conventions for the 
encoding of such corpora, building on and contributing 
to the preliminary recommendations of the relevant 
international and European standardization initiatives. 
MULTEXT will also work towards establishing a set of 
guidelines for text software development, which will be 
widely published in order to enable future development 
by others. All tools and data developed within the project 
will be made freely and publicly available. 
Keywords. multi-lingual corpora, text markup, text 
software, corpus annotation. 
1. Introduction 
Text-oriented methods and software tools have come to 
be of primary interest to the NLP community. However, 
existing tools for natural language processing (NLP) and 
machine translation (MT) corpus-based research are 
typically embedded in large, non-adaptable systems 
which are fundamentally incompatible. Little effort has 
been made to develop software standards, and software 
reusability is virtually non-existent. As a result, there is a 
serious lack of generally usable tools to manipulate and 
analyze text corpora that arc widely available for 
research, especially for multi-lingual al)plications. 
At the same time, the availability of data is hampered by 
a lack of well-established standards for encoding 
corpora. Although the Text Encoding Initiativc (TEI) has 
provided guidelines for text encoding \[Sper94\], they arc 
so far largely untested on real-scale data, especially 
multi-lingual data. Further, the TEl Guidelines offer a 
broad range of text encoding solutions serving a w~riety 
of disciplines and applications, and are not intended to 
provide specific guidance for the purposes of NLP and 
MT corpus-based research. 
MIJLTEXT (Multilingual Text Tools and Corpora) ix a 
recently initiated large-scale project funded under tim 
Commission of European Communities Linguistic 
Research and Engineering Program, which is intended to 
address these problems. The project will contribute to the 
development of generally usable software tools to 
nmnipulate and analyse text corpora and to create multi- 
lingual text corpora with structural and linguistic 
markt, p. It will attempt to establish conventions for the 
encoding of such corpora, building on and contributing 
to the preliminary recommendations of the relevant 
international and European standardization initiatives. 
MULTEXT will also work towards establishing a set of 
guidelines for text software development, which will be 
wklely published in order to enable future development 
by others. The project consortimn, consisting of eight 
academic and research institutions and six major 
European industrial partners, is committed to make its 
results, namely corpus, related tools, specifications and 
accompanying documentation, freely and publicly 
available. 
2. Project Overview 
At the outset of the project, the consortimn will 
undertake to analyse, test and extend the SGML-based 
recommendations of the TEl on real-size data, and 
gradually devch)p encoding conventions specifically 
suited to nmlti-lingual corpora and the needs of NLP and 
MT corpus-based researcb. To manipulate large 
quantities of such texts, the partners will, in collaboration 
with the recently established Text Software Initiative 
(TS\[), develop conventions fo," tool co,~struction and use 
tbem to build a r:mge of highly language-independent, 
atomic and cxtensible software tools. 
These specifications will be the basis for thc 
development of two major software resources, namely 
(a) tools for the linguistic annotation of texts (e.g. 
segmenters, morphological analysers, part of speech 
disambiguators, aligners, prosody taggers and post- 
editing tools), and (b) tools for the exploitation of 
annotated texts (e.g. tools for indexing, search and 
retrieval, statistics). This software will be implemented 
under UNIX, while its specific properties should 
facilitate portability to other systems. Moreover, it will 
be integrated by means of a common user interface into a 
text corpus manipulation system expected to provide the 
basic functionality needed in academic or industrial 
corpus research. For the overall software design as well 
as the development of specific components, MULTEXT 
will capitalise on the experience and, possibly, 
preliminary results achieved in the ALEP project. 
By using the emerging software tools, the consortium 
plans to produce a substantial multilingual corpus, 
including parallel texts and spoken data, in six EC 
588 
languages (English, French, Spanish, German, Italian and 
Dutch). The entire corpus will be marked for gross 
logical and structural features; a subset of the corpus will 
be marked and hand-validated for sentence and sub- 
sentence features, part of speech, alignment of parallel 
texts, and speech prosody. All markup will have to 
comply to the TEI-based corpus encoding conventions 
established within the project. Tim corl)us will also serve 
as a tcstbed for the project tools and a resource for future 
tool development and evaluation. 
An application programming iuterfime will facilitate the 
coupling of the progressively refined software and data 
components with several existing langt, age application 
systems or prototypes. In particular, the industrial 
partners phm to develop extraction software fl)r lexieal 
and terminological infornmtion to complement and 
improve their Terminology Management, Information 
R.etrieval or Machine Translation systems. Some effcwt 
will also be devoted to a prototypical api)lication for 
testing and comparing successive versions of a Machine 
Translation system. 
3. Background and approach 
3.1. Software Standard 
MULTEXT is strongly committed to "software 
reusability", to avoid the re-inventing of tim wheel and 
development of largely incompatible and non-extcnsihle 
software that is characteristic of much language-analytic 
research in the past three decades. Therefore, the project 
will establish a software standard for the development of 
its tools. This will enable these tools to be universally 
used and extended hy others. 
We outline here the principles (borrowed from 
\[IdeV93a\]) nnderlying the MULTEXT approach to 
software design, which enable flexibility, extendability, 
and reusability. 
• Principle 1: lxmguage independence 
The first goal is to extend existing mctlmds to otlmr 
European languages. So far, these methods have been 
applied almost exclusively to English. Therefore, the 
methods will be adapted to produce language- 
independent tools, by using an engine-based approach 
where all hmguage dependent materials arc provided as 
data. Thus, extension of the tools to cover additional 
languages will in most cases involve only providing the 
appropriate tables and rules. 
• Principle 2: Atomicity 
Existing text analytic software often comprises large, 
integrated systems that are nearly impossible to adapt or 
extend. MULTEXT will produce a set of small tools 
(often on the order of a few lines of code, with the 
absolute minimum of t'unctionality) that researchers can 
use alone or combine to create larger, more complex 
programs, thereby implementing a "software l.cgo" 
approach. In this way, increasingly complex program 
bundles can be developed without the overhead of large 
system design, and with ease of modification since any 
program can be de-bundled into its constituent programs, 
each consisting of small, easily understandable piece of 
code. MULTEXT will bundle its tools in a 
comprehensive corpus-handling system, as welt as 
demonstrate their use in several high-level applications, 
thus showing different ways in which tim "Legos" can be 
recombined in specific applications. 
• Princil/e 3: OperatotZ~'tream approach 
MUI.'I'EXT will adopt the operator/stream apl)roach to 
scfftware design, which has had widespread 
implementation and use and is generally accepted in 
research and industry. In particular, it has been used 
increasingly in computational linguistics applications 
(see, for instance, \[I,ibe92\]). The operator/stream 
approach has served as the basis for the UNIX operating 
system, which as a result provides a ready-made platform 
for its implementation. 
In the operator/stream approach, data flows in uni- 
directional "slreams" between functions. Each of these 
functions is an "operator" that translbrms the data as it 
passes by. Since everything is understood in terms of 
what goes in and what comes out, the emphasis is on 
what needs to be done rather than how it is done. "Fhis 
enables a focus on overall algorithms rather titan 
implementation details. Component functions are 
independent, and at no point are compiled together in a 
single program. This is a key point, since it means that 
each operator can be implemcnted in a different 
language, developed by different people, tested 
independently, etc. In addition, new functions can be 
phlgged into the stream as necded, and all ft, nctions are 
completely re-usable in other contexts. 
• Princil~le 4: \[hrique &tta type 
Commtmication between programs will be by lneaus Of 
flat, \]roman readable streams and files, apart from well- 
defined, encapsuhlted binary formats for cases such as 
speech signal, images, or indexes. The only data type is 
therel'tn'e the stving. There is some overhead in this 
;.Ipproilch, since conversion froul string to, s'ty, nulnbcr 
and back is required I()r numbers that are to be 
manipulated arithmetically, but tim speed and storage 
capacities of present-day machines virtu:ally eliminate 
this concern. More importantly, the use of string data 
only enables an easy test-modify-test cycle, since the 
input and outi)ut of any step can be examined and 
manii)ulated using all-purpose tools freely awtilable on 
most machines, such as text editors, search software, 
sorting utilities, etc. Fi,ially, colnplex data types tic 
programs to specific Izmguages that implement those 
types. The use of a unique data type eliminates this 
dependency. 
A fe,'dure of lilts strategy which is of major importance is 
that any system can accept flat files. Therefore, data is 
portable between different systems. In addition, it is 
lllUCh easier \[O port software from system to system, 
since tim software accepts lhe same kind of input data. 
l:()r example, a program in C is likely to work on any 
system with no or very minor modification. 
589 
• Principle 5: Internal standardfi~rmats (ISFs) 
To write the compatible set of tools we describe, it is 
essential that all programs communicate effectively. This 
demands that internal standard formats (ISFs) for data be 
developed, to serve as specifications for program 
development. It is essential that these formats are public, 
so that any program written anywhere by anyone can use 
thmn. 
ISFs, like the functions that process them, are very 
simple and straightforward. Many ISFs will be needed to 
accomodate different possible "interpretations" of the 
data, and their development will demand careful 
consideration of text types, their structures and 
properties. Therefore, ISF development should build 
upon the TErs work on text structures and categories and 
ensure compatibility with it. Note that because ISFs 
represent only partially the information in an encoded 
text (that is, whatever is required for certain operations), 
they do not replace a TEI/SGML encoding of data, which 
represents all the information in an encoded text and can 
be used for interchange. Transduction programs to 
import TEI-eonformant texts into one or more internal 
standard formats, and vice versa, will be essential. 
3.2. Tools 
All MULTEXT tools will be developed according to the 
principles outlined above. The project will use only well- 
known, state-of-the-art methods in tool development, in 
order to ensure the project's feasibility (e.g., \[Chur88\], 
\[Cutt92\], \[Gale9 l \], \[Hirst93\], \[Hirst91\]). The project will 
use these methods to produce a set of tools that is freely 
available, coherent, extensible, and language 
independent. The tools will be implemented under 
UNIX, but will be developed according to principles that 
will facilitate portability to other systems. 
The high-level tools produced by the project fall in two 
general categories of corpus-handling fimetions that are 
basic across applications (these functions apl)ly to mono- 
lingual texts, multi-lingual parallel texts, and speech): 
• Corpus annotation tools: 
• segmentcr: marks sentences, quotations, words, 
abbreviations, names, terms, etc.; 
• morphological analyser: provides possible lemmas, 
morpholgical features, and parts of speech; 
• part of speech disambiguator: disambiguates part of 
speech where alternatives exist; 
• aligner: provides alignments of passages among 
parallel texts; 
• prosody tagger: derives automatic modelling of F0 
curve and symbolic coding of intonation from the 
speech signal; 
• post-editing tools: assist in hand validation of 
automatically annotated corpora. 
• Corpus exploitation tools: 
• indexing tools: construct indexes for fast access to 
data; 
• search and retrieval tools: browsing, concordancing, 
retrieval of collocations, etc., based on a given 
word, words, pattern, syntactic category, etc.; 
• statistical and quantitative tools: generate lists and 
statistics--basic statistics for words, collocates 
(pattern or part of speech) such as frequency, 
mutual information, etc. Also word lists, lists by 
syntactic category, etc. 
To provide support for these tools, several other general 
utilities will be required, such as general data 
manipulation tools, UNIX shell tool, etc. In addition, the 
tools will be integrated by means of a common user 
interface into a general-l)urpose corpus manipulation 
system suitable for NLP and MT research. 
3.3. Markup Standard 
One of the goals of MULTEXT is to develop standards 
for encoding text corpora. 
We distinguish four levels of document markup: 
• Level O. Document-wide markup: 
• bibliographic description of the document, etc. 
• character sets and entities 
• description of encoding conventions 
• Level 1. Gross structural markup: 
• structural units or text, such as volume, chapter, 
etc., down to the level of paragraph 
• footnotes, titles, headings, tables, figures, etc. 
, Level 2. Markup for sub-paragraph structures: 
• sentences, quotations 
• words 
• abbreviations, names, dates, terms, cited words, etc. 
• Level 3. Markup for linguistic annotation: 
• nmrphological information 
• syntactic information--e.g., part of speech 
• alignment of parallel texts 
• prosody 
Level 0 provides glohal information about the text, its 
content, and its encoding. Level 1 includes universal text 
elements down to the level c,f paragraph, which is the 
smallest unit that can be identified language- 
independently. 1.evel 2 explictly marks sub-paragraph 
structures which are usually signalled (sometimes 
ambiguously) by typography in the text and which are 
language dependent. Level 3 enriches the text with the 
results of some linguistic attalyses. 
The TEI guidelines \[Sper94\] provide the basis for 
MUI:FEXT corpus markup for levels 0 (the TEI header), 
1 and 2 as well as many elements of level 3. However, 
the TEl standard will need careful examination and 
adaptation \[IdeV93b\]: 
(1) the TEI sclteme is intended to be maximally 
applicable to a variety of encoding purposes and 
applications. Therefore it in many cases specifies several 
encoding options for the same phenomena, and provkles 
options and elements without the specific needs of 
corpus markup in mind. 
(2) the TEl scheme is not complete; many areas are yet 
to he addressed. For example, no TEI encoding scheme 
590 
for some aspects of spoken materials, such as prosody 
iF0 modelling, synlbolie coding, etc.), exists. 
(3) the 'l'lZ.l scheme is largely untested on corpora, 
especially multi-lingual corpora. Therefore, use of tile 
TEI scheme for corpus eucoding will ahnost certainly 
require modification and extension. For instance, TEl 
mechanisms for aligmnent will require exteusion and/or 
modification to handle lnulti-lingual text alignment and 
aligmnent of different levels of speech representation 
(signal, orthographic tra,lscriptiou, i~honelnic 
transcription, prosody). 
(4) the TEl scheme specifically does not aiul to provide 
recomnlendillions for certain content-related eleinents. 
For example, while the 'tEl provkles several means to 
mark POS, it is not within tile scope of the TEl to 
provide a standardized set of POS category #lanles. 
hlstead, it provides a flexible incehanism that can 
aecomodate any set of actual tag uames. Similarly, tile 
TEI does not provide guidelines for uames which might, 
for example, be used as identifiers for texts, text 
categories, etc. 
MULTEXT will use tile TEl scheme as the basis for the 
developmeut of a "H'\]l-confornlant Corptts Em:oditlg 
Style iCES) that is optimally suited to NI,I' research and 
can therefore serve its a widely accepted TEl-based style 
tot European corpus work. 
3.4. Corpus 
Tile goal of MUI.TEXT is not to duplicate the various 
large mulli-lingual data gathering initiatives by collecting 
raw data. The intent of the p,oject is to provide a 
vahlable resource that is not provided elsewhere, in tile 
form of a high quality multi-lingual corpus for six 
European languages, annotated for basic struetllral 
features as well as sub-l)aragraph se,~meiltalioii, POS, 
and alignment ill parallel texts. 
The priinary goal of tile MULTEXT corpus is to inovide 
;.ill example and testbed for: 
(1) multi-lingual tools (especially cnginedmsed tools, 
aligunlent software, and nlulti-lingual exlr;iction tools); 
and 
(2) nlarktlp across a large variety of languages (h/cltldirl~.; 
TEl text markup and the NERC panoeurolman part-of- 
speech tagset \[Mona92\]). 
MUI.TEXT has a secondary but inlportarlt gt)al to 
provide a corF, us of value for geueral linguistic analytic 
purposes, a,ld will aim to serve this goal to the extent 
possible without compromising or coniplicating the 
primary goal. 
The corpus will aim for three parts, each comprising six 
languages (English, French, Gernlan, Italian, Spanish, 
I)utch): 
(1) a comparable corpus, consisting of 2M words per 
language, composed of coral)arabic types of texts fronl 
two or three different domains. Ten percent of the corpus 
for each language will be nlarkcd aud hand validated ti)l 
sul)-lmragraph segmentatiou and POS. 
(2) a parallel CO#TJIts, composed of fully paralIcl text,; 
across the six languages and incltiding 2M words per 
language. Ilalf of the corpus for each hmguage will be 
marked and hand-villidatcd for sentence alignmeut. Ten 
percent of tile corpus for each langtiagc will be marked 
and hand-wilidated for sub-paragraph scglnentation illld 
POS. 
(3) a Sillall speech co#7~ltS, consistiug of additiollal 
niarkup to he tised in conjunction wilh tile F~UROM-I 
speech database. There is iuoveinenl towards lhe 
integration of NI.I' and Slmeeh (see, for oxanlple, 
I{I.SNI\]'I'); MUI/I'IiXT will explore the possilfilities for 
such iutegraiion by attenipting to harmonize tools and 
llleihods froill both <'u'eiis. MUI.TEXT will pay speeiitl 
attention to pheilonleua ill the iutorseetion ill' the two 
doulilius, in particilhlr prosody, whose supra-segulental 
nattlre invites researc\]l iuto the coinplex relatioi~ships it 
holds with nlorphology alld syntax. 
To serve its goals, MIJIM'I'\]XT will aim to construct its 
CoipUS accoltlillg to the 17111owing principles: 
° lJritleip\[e I." Cotlsistetlcy 
The salliC six languages will be represented in equal 
illllotints in all paris of tim corpus. Simihuly, equal 
anlouilis of the same typos of texts will tm provided for 
each language. 
• l'rineiple 2: ½1riely rclther than r¢7)resemativeness 
The MULTF, XT corpus is small-scale compared to 
national efforts aimed at providing balanced, 
representative corpora ill a sitlgle language. The project 
does not therefore aim at representativeness or balance in 
constructing its corpus. Instead, tile MUI,TEXT corpus 
will contain a variety of texls of different types and front 
different dolllailas, generally following (where 
apllroflriate ) knowI1 criteria I'rOlll corptls \]hlgilistics. 
" I'rim:iple 3." lligh quail O, oJ',larkup 
Ill Ihe slate of the ilrl, autoiil'llic uiarkup of segnlcntation, 
POS, iuld aligllllleiit is aboul 90..96% COH'OCi for I{nglish 
(and French in tile case of the tlansard), lu order to 
provide a reference corpus for ftlrthor testing of 
inethodologies and tools, MI;II.'I't~,XT will hand-validate 
a portion ill' its corptlS to lllake it virtually error-free. 
• Prineilde 4" Reuse ofavaihthh, &mr 
MULTEXT is not committed to the goal of collecting 
data, but rather to enhancing with structural and 
linguistic annotation data which ulay bc available from 
other sources. The imtiect therefl)rc aims to use existing, 
clean dala to the exleilt possible, in order to avoid the 
overhead of tile acquisition process. 
l'rilwl'lde 5: Com,lil,lem to StWldwds 
MUI,TEXT will use, build upon, and contribute to 
standards for text markup, inchiding those of the Tlil as 
591 
well as the EAGLES pan-European POS tagset. Because 
neither of these schemes have been widely tested, the 
MULTEXT corpus will provide both a testbed and a 
basis for their evaluation and modification or extension. 
4. Exploitation and Future Prospects 
It is expected that the availability of basic multi-lingual 
tools and data will improve and extend R&D across a 
wide range of disciplines, including not only the various 
areas of NLP (language understanding and generation, 
translation, etc.), but also fields such as speech 
technology, language learning, lexicography and 
lexicology, literary and linguistic computing, information 
retrieval, etc. By feeding the results into several 
commercial applications systems/prototypes, the project 
is expected to show the potential of state-of-the-art 
methods in corpus linguistics for improving industrially 
relevant language systems and services. 
References 
\[ldeV93a\] Ide, N., Veronis, J. (1993). What next alter 
the Text Encoding Initiative? The need for text software. 
ACtlNewsletter, Winter 1993, 1-12. 
\[Libe92\] Liberman, M., Marcus, M. (1992). Tutorial 
on Text Corpora, Association for Computational 
Linguistics Annual Conference. 
\[Mona92\] Monachini, M., Ostling, A. (1992). Towards 
a Minimal Standard for Morphosyntactic Corpus 
Annotation, Report of the Network of European 
Reference Corpora, Workpackage 8.2. 
\[Chur88\] Church, K. W. (1988). A stochastic parts 
program and noun phrase parser for unrestricted texts. In 
Proceedings of the Second Conference on Applied 
Natural Language Processing. Austin, Texas, 136-143. 
\[Cutt92\] Cutting, D., Kupiec, J., Pedersen, J., Sibun, 
P. (1992). A Practical Part of Speech Tagger, 
Proceedings of the Third International Conference on 
Applied Natural Language Processing, Trento, 133-140. 
\[Gale91\] Gale, W., Church, K.W. (1991). A Program 
for Aligning Sentences in Bilingual Corpora, 
Proceedings of the ACL Conference, Berkeley, 177-184. 
\[Hirst93\] Hirst, D., Espesser, R. (1993) Automatic 
modelling of fundamental frequency. Travaux de 
l'Institut de Phonetique d'Aix, 15, 71-85. 
\[Hirst91\] Hirst, D., Nicolas, P., Espesser, R. (199l) 
Coding the F0 of a continuous text in French : an 
Experimental Approach. 12eme Congres International 
des Sciences Phonetiques, Aix-en-Provence, 5,234-237. 
\[IdeV93b\] Ide, N., Vdronis, J. (1993). Background and 
context for the development of a Corpus Encoding 
Standard, EAGLES Working Paper, 30p. 
\[Sper94\] Sperberg-McQueen, C. M., Burnard, L. 
(1994) Guidelines for Electronic Text Encoding and 
Interchange, Text Encoding Initiative, Chicago and 
Oxford (in press). 
Appendix - Descriptive overview 
MULTEXT (Multilingual Text Tools and Corpora) 
Coordinator 
Dr. Jean Vdronis 
Laboratoire Parole et Langage 
CNRS & Universitd de Provence 
29, Avenue Robert Schuman 
F-13621 Aix-en-Provence Cedex 1 
tel: +33 42 95 20 73 
fax: +33 42 59 50 96 
e-mail: vcronis@fraixll.univ-aix,fr 
Start Date Jan. 1994 
Duration 26 months 
Resources 238.5 person-months 
Estimated total cost 3.210.000 ECU 
Partners Country 
CNRS FR 
EUROLANG-SITE FR 
INCYTA ES 
Digital Equipment B.V. NL 
CAP debis Systemhaus KSP DE 
University of Pisa (ILC/CNR) IT 
University of Edinburgh (HCRCfl~TG) UK 
ISSCO CII 
Associated Partners Country 
Siemens Nixdorf Informationssysteme AG DE 
Universitaet Muenster DE 
Rank Xerox Research Center I:1~ 
Universitat Autonoma de Barcelona ES 
Universitat Cen(ra\[ tie llareelona (FBG) ES 
Universiteit Utrecht NL 
592 
