INTEX: A CORPUS PROCI!\]SSIN(-?, SYSTEM 
Max I). Silberztein 
Laboratoirc d'Automatiquc 1)ocumentaire ct Linguistique 
Universil6 Paris 7 
ABSTRACT 
INTEX is a text processor; it is usually used to 
parse corpora of several megabytes. It includes 
several built-in large coverage dictionaries and 
grammars represented by graphs; the user inay 
add his/her own dictionaries and gramnlars. 
These tools am applied to texts in order to locate 
lexical and syntactic patterns, remove ambigu- 
ities, and tag words. INTEX builds collcordances 
and indexes of all types of patterns; it is used by 
linguists to analyse corpora, but can also be 
viewed as an information feb'lewd system. 
INTROI)UCTION 
INTEX automatically identities words and mor- 
pho-syntactic patterns in large texts. By using 
INTEX, one can: 
.... build the dictionary of lhe words of the texts; 
words may be simple words (sequences of letters, 
e.g. table), compounds (sequences of simple 
words which include a separator, e.g. worU pro~ 
cessor) or complete expressions (sequences of 
words which accept insertions, e.g. to kick ... the 
bucket); 
--- locate in texts all occurrences of a given word 
(even if inflected), a given category (e.g. all femi- 
nine plural adjectives) or a morpho-syniactic pat~ 
tern (a regular expression); 
--apply grammars represented by recursive 
graphs to texts; build indexes or concordances for 
all occurrences of the previous patterns; 
.... use local grammars to remove word and uller- 
ance ambiguities in texts, or to detect errors or 
deviant sequences. 
While INTEX already i,lcludes several built-in 
dictionaries and granunars, it allows tile user to 
create, c(lit and add his/her own tools, hi order to 
increase coverage of texts and to remove addi- 
tional ambiguilies. 
1. LINGUISTIC TOOI,S 
The user th'st loads a text and selects the woi'kiug 
langl.iage I. INT\[~.X counls lhc nulnbor of lokens 
in the lexl, lhe number of different ones, and sorts 
lhoni by frequency. Theil Ihe user selects linguis-- 
tic tools to parse the text. Tools aye either diclio.. 
nnries or tinilO stale transducers (FSTs). 
1.1. Dictionaries 
INTEX is based on lwo large coverage builtqn 
dictionaries: 
-the I)IT:.LAI ~ diclio,mry contains over 700,000 
simple words, basically all the simple words of 
the language 2. Each entry in the I)ELAI: is asso. 
cialed wilh explicit morphological infornlathm 
for each word: its canonical form (e.g. the intini- 
live for verbs), its part of speech (e.g. Noun), aud 
some inllectional information (e.g. th'st person 
singular present). I lere are three entries of the 
t:i'onch I)EI,AI::: 
a, avoil: V'.P3s 
abacas, abaca. N:mp 
abais.va, abaisses: g.',lXs 
The token 'a' is the Verb 'avoir' con, jugaled in tilt 
Third Person Singular l'resent (P3s); 'abacas' is 
the masculine plural of the Noun 'abaca'; 
'abaissa' is a verbal form of 'abaisser' COlljugated 
in lhe third person sirigular "Passe colnposC' 
(J3s). Since the lnorphological analysis of each 
1, At this moznefit. English. French and Ilalian tlicthmaries have 
boon already included in INTI:,X. (lermail. ,Spanish alld Poflu- 
.~tlOS(', compatible diclionaries lift: tlll(ICl' COllsIrucliOll. '~Vt: will 
lJiVe Froilch o×ainl)les. 
2. For ii discussiOll on the COillpleloness/)I" lilt', DEI ,AF dictionary. 
see in I(?ourloi,'-;: ,Rilborztein 10~91. IClemeiweau 19931. 
579 
token is performed by a simple lookup routine, 
INTEX guarantees an error free result (there is no 
guessing algorithm nor 'probabilistic' result). 
INTEX includes a few other dictionaries for 
proper names, toponyms, acronyms, etc.; 
--the DELACF dictionary contains over 
150,000 compounds, mostly nouns 3. Each entry 
in the DELACF is associated with its canonical 
form, its part of speech, and some inflectional 
information. Here are three entries of the French 
DELACF: 
h tout de suite, h tout de suite. ADV 
cartes bleues, carte bleue. N:fp 
pomme de terre, pomme de terre. N:fs 
INTEX includes a few other dictionaries for com- 
pound proper names. The use," may add his/her 
own dictionaries for simple words and com- 
pounds. 
1.2. Finite State Transducers 
FSTs are represented in INTEX by recursive 
graphs. Basically, the "input" part of an FST is 
used to identify patterns in texts; the "output" 
part of an FST is used to associate each identified 
occurrence with information. In many cases, 
FSTs represent words more naturally than dictio- 
naries. For example, numerical determiners, such 
as trente-cinq mille neuf cents trente-qttatre, for- 
really are compounds which are naturally repre- 
sented by graphs (see the graph Dmlm in 
Appendix 1). FSTs may also be used to bring 
together graphical variants of a woM in order to 
check the spelling coherency, to associate all the 
variants of a term with a unique canonical ent,'y 
in an index, to represent families of derived 
words (see the graph France in Appendix 1), to 
associate synonyms of a term in an information 
retrieval system, etc. In the graph editor, gray 
nodes are graph names; tags written in white 
nodes are the inputs of the FSTs, outputs are writ- 
ten below nodes 4. The user draws graphs directly 
3. For a discussion on the completeness of the DELACF, see in 
\[Courtois; Silberztein 1989\]. 
4. For a description of the graph editor of INTEX, see \[Silberztein 
1993\]. 
on the screen; the resulting graphs a're interpreted 
as FSTs by INTEX. 
By selecting and applying dictionaries and FSTs 
to a text, the user builds the dictionary of the 
words of the text. Appendix 1 shows the resulting 
dictionary, as well as the list of all unknown 
tokens. Generally, these tokens are either spelling 
errors or proper names. 
2. LOCATING PATTERNS 
After having built the dictionary of the words of 
the text, the user can locate morpho-syntaetic pat- 
terns in the corpus, index o1' build a concordance 
for all occurrences of the pattern. Patterns may 
be: 
--a word, or a list of words. For example, one 
can locate in a text all occurrences of the verb 
faire (even when inllected), all the compound 
nouns (since most of them are non-ambiguous 
terms, their list constitutes a good index); 
--a given category, such as verb conjugated in 
the third person sitzgttlar (V:3s), or noun in the 
feminine plural (N:fp), etc. Here arc several 
examples of categories5: 
A:p (adjective in plural), ADV (adverb), 
DE7".'f (femirzine determirzer), 
DKms (past participle, mascttline singu- 
lar), etc. 
--a syntactic pattern represented by a regular 
expression or a graph; the following is a regular 
expression: 
<t?tre> (<ADV> + <E>) <DET> <N> 
This pattern re'Itches any sequence beginning 
with a conjugated form of the verb e?tp'e, option- 
ally followed by an adverb (<E> stands for the 
null word), followed by a determiner and then a 
noun. Note that categories match simple and 
compound words. In particular; <ADV> also 
matches compound adverbs. More generally, the 
use," may apply to the text grammars expressed 
by recursive graphs; graphs typically represent: 
-- sees of synonymous expressions, such as : per- 
dre Ia t~te, l'esprit, le nord, etc. Graphs in differ- 
5. For a syntactic description of the categories, see \[Silberztein 
19931. 
580 
ent languages can be linked, so that each 
matching sequence in the source language could 
be automatically associated with the correspond- 
ing graph in the target hmguage (e.g. lose one's 
head, mind, bearings, etc.). A graph may repre- 
sent all the expressions which designate an entity, 
or a process; indexing such graphs allows one to 
retrieve information in large corpora; 
-- pieces of a large-coverage grammar of the lan- 
guage. Recnrsive graphs are easily edited; stan- 
dard operations on graphs (union, intersection, 
differences, etc.) help to build an easily main- 
rained system of hundreds of elementary graphs. 
This construction has begun in LADL; we 
already have graphs describing adverbial comple- 
ments which express a measure (temperature, 
speed, length, etc.), a time or a date (e.g. le 17 
fdvrier 1993, le premier hmdi du mois de jnin) 
(Maurel 1989), some locative structures (Garri- 
gues 1993), etc. 
3. REMOVING AMBIGUITIES 
In order to disambiguate words in texts, INTEX 
uses cache dictionaries and local grammars. 
3.1. Cache dictionaries 
Since the DELAF and DELACF dictionaries 
included in INTEX have a very large coverage, 
they contain a number of words which only occur 
in some specific domains; in addition, some fie- 
quent words may be associated with generally 
inappropriate information. For instance, par is 
usually a preposition in French, but in some cases 
it may be a noun (a technical term in gol\[). By 
default, each occurrence of this token will be con- 
sidered ambiguous (preposition or noun). Cache 
dictionaries are used as filters: if INTEX finds a 
word in a cache dictionary, it will not look tip the 
selected dictionaries and FSTs. If the user knows 
that in a given corpus, the token par is always a 
proposition, he/she enters the following entry in a 
cache dictionary: 
pat; par. PREP 
Hence, the user can avoid unnecessary ambigu- 
ities by putting frequent words (or conversely, 
specific terms) in cache dictionarids adapted to 
each processed text. 
Most compounds are ambiguous, since they for- 
really are sequences of simple words; for 
instance, the sequence pomme de terre is not nec- 
essarily a compound noun in the following sen- 
tence: 
Luc recottvre une pomme de terre tulle 
(Luc covetw a cooked potato) 
(Luc covers an apple with scorched earth) 
However, a number of compounds are not ambig- 
uous, either because they contain a non-autono- 
mous conslituent (e.g. aujourd'hui), or because 
they are technical terms (e.g. tm lube cathodiqtte, 
un sous-marin nucldaire). By entering these non- 
ambiguous compounds in a cache dictionary, the 
user prevents INTEX fi'om looking up dictionar- 
ies and FSTs for simple words; hence INTEX 
does not process these conlpounds as ambiguous. 
3.2. Local grammars 
A local granmaar is a two-part rule: if a given 
sequence of words is matched, then each word in 
the sequence is tagged in the proper way. For 
instance, in the sequence s'en donne, s' is a 1)1"o- 
noun (not a conjunction), en is a pronoun (not a 
preposition), and donne is a verb (not a noun). 
The corresponding local grammar would be: 
s '/<PRO> en/<PRO> <MOT>/< V> 
<MOT> stands for any word. Local grammars arc 
represented by FSTs, heuce their length and their 
COml)lcxity have no limit. Any number of local 
giammars may be used at lho sanie (line to disanl~ 
bigualo Ioxls (FSTs Inorgo easily); hence it is best 
Io el'tale small ()lieS. Local ~l'aillnlal's use lhc dic- 
tionary of the words of the texts, so they correctly 
haildle sequellCOS with coinpounds. Appendix 2 
shows a few local grannllars. IN'rEX inchidos a 
dozen "pcrfccl" local granllliars, tllat is, gram- 
ilqars that will never give hlcorreot lagging sohi- 
tioils; the user may add his/her own perfect (or 
probabilistic) disan~bigualing gralnnlars. 
3.3. The resiill of lhe parshig 
After having selected linguistic tools (either dic- 
tionaries or FSTs), the riser cau parse tile text, 
that is, insert in the text all the linguistic informa- 
tion reqt, ired by a syntactic parser. For instance, 
the text: iI la donne would at this step be repre- 
sented by the following expression: 
iI, PRO 
(la, PRO:fr + la, DUl'.'fs) 
(donne, N:fs + donner, V.'PIs + donner, V:P3s 
donnel; V: S l s + donne r, V: S3s +donner, V.' Y2s ) 
la can be a pronoun or a determiner; donne is a 
noun, or 5 conjugated forms of the verb donner. 
INTEX then builds the corresponding minimal 
automaton: the number of transitions of this 
automaton corresponds to the number of lexical 
ambiguities of the text (in the above example: 9 
transitions). By selecting and applying local 
grammars to the text, the user effectively removes 
transitions in the resulting automaton. For 
instance, thanks to a simple local grammar 
(which describes the preverbal particles), the 
above text can be parsed to give the following 
expression: 
it, PRO 
la, PRO:fs 
(donner, V.'P3s + donner, V:S3s) 
The remaining ambiguity corresponds to the 
tense of the verb: indicative or subjunctive 
present. The corresponding automaton has only 4 
transitions. Hence, the number of transitions can 
be used as a quantitative tool to measure the effi- 
ciency of the removal of ambiguities. By select- 
ing one local grammar at a time, or by merging 
several, the user is able to apprehend exactly how 
each grammar covers the text, and pcrlbnns in 
terms of deleting transitions. 
CONCLUSION 
INTEX is used for several purposes: 
--lexicographers who build dictionaries for 
compounds (or technical terms) try to find new 
ones by applying characteristic patterns to big 
corpora, such as: <N> (de + d' + de la + du + 
des) <N>; 
--linguists who study specific syntactic struc- 
tures use INTEX to find attestations of these 
structures. For instance, one may search for the 
following structure in order to find predicative 
nouns associated to the support verbs avob; don- 
net; Ftre, Jaire: 
( <avoir> + <donner> + <~tre > + <fai re >) 
(<ADV> + <E>) <N> 
--our objective is to build a large grammar 
which covers as much of the language as possi- 
ble. By applying "pieces" of grammar to big cor- 
pora, and then studying the outputs, one can 
correct and refine each piece, and incrementally 
develop the global grammar; 
--INTEX is used to find "semantic units" in 
large technical texts, hence it constitutes a good 
information retrieval system. 
REFERENCES 
Courtois, B., Silberztcin M. Eds, (1989). Les dic- 
tionnaires Flectroniques. Langtte fran~:aise, 
Larousse : Paris. 
Clemenceau D. (1993). Structttration du lexique 
et reconnaissance de roots dFriw;s. Th~se 
de doctorat en infornaatique, LADL, Uni- 
versit6 Paris 7 : Paris. 
Garrigues M. (1993). Pr6positions et noms de 
pays et d'~les : une grammaire locale pour 
l'analyse automatique des textes. In Lin- 
gvisticae lnvestigationes XVII:2. John 
Benjamins: Amsterdam. 
Maurel 1). (1989). Reconnaissance de sFquences 
de roots pat" automate. Th~se de doctorat eta 
informatique, LADL, Univcrsit6 Paris 7: 
Paris. 
Silberztein M. (1993). Dietiontmires dlecttwtti- 
qttes et analyse atttomatiqtte de textes. 
Masson : Paris. 
582 
Appendix 1: Building the dictionary of the text 
,qdveme~ (1)) lli!i! Fgi!!i i i! I" ! (<- " , ": 
: , iiii!it (;i!i!: t i 
...... ~£,,¢h,~¢s,ourl, fill~l,.dCMi,=ll,~aRl'A.A,~\]lrir, l~)2 ....... !:1t .,, .... lit~ ii!i! ti :+iU~ 
d~vratl )~ lm~cer dan~ 1~ fabrication de )hotocopicur~ et ii: 5~9 1o= i@:iiii~ !£~iii!~/~i!,"i~i~ 
' : : 4,1'1 I.~ i:::! !!:i: ~: i:i!i 
i',,'!1 l;.in ~.~,~ +,t+ |iiti~@i::iiiiiii~iil 
" ...... l:i:l ==~ ,,,~ I!!~Siiii! 
":"~>,== =,,, +p~, r=,=,d IiI/iiiiiiii!i\[ ii:iii : 11ii \[:\[ii/iiiii:ii~ii~,~'ii ............ tt u,~.,..,,w~m~ \[ 
i~r~iiiiiiiiii!iiliiiiil lit ii i i\[iii\[i!iii ii£iii li~.l C 0.1 o s,c I o. ~ O) 
M~lsGtammal cau~ (O) iliii i~J 6e ottap hl,uC~\]p (a) 
~!\];;;i i; ;(!; ;)}:ii} 
i+ 
A3t 
tt dl 
~/ .QI\]t - ,,- ~,~ 
~X / ^~ All 
:;:1~, " ; ~r~tlse$\[ i I \[!\[:~:\]: C~cI 
| ~L~ d*autrel ~, d'autres ADV'PI)I!'fC "~'~\[C~mfir 
~;~ : -- \[,TJ~ : ......... ~ ,, ............. a~v.~v, ............ 
Appendix 2: l,ocal grammars and the removal of ambiguities 
<PI(O ~ 
/L.~D L/IN IE R.~tp p/Cet, e~J II 
\]u~ Apt Z6 17:1)B:63 1994 
i:1 
^DV ,YPI~:O ~ 
583 
