TEXT DISAMBIGUATION BY FINITE STATE AUTOMATA, 
AN ALGORITHM AND EXPERIMENTS ON CORPORA 
Emmanuel Roche* 
Insitut Gaspard Monge CERII,-LADL** 
Paris France 
1. Abstract* 
Consulting a dictionary for the words of a 
given text provides multiple solutions, that is, 
ambiguities; thus, the sequence of words pilot 
studies could lead for example to: 
pilot: N singular, V infinitive, V (conjugated) 
studies: N plural, V (conjugated) 
pilot studies: N plural (compound). 
These informations could be organized in the 
form of a finite automaton such as: 
pilot studies N plural 
| "'" (compound) | 
The exploration of the context should provide 
clues that eliminate the non-relevant solutions. 
For this purpose we use local grammar 
constraints represented by finite automata. We 
have designed and implemented an algorithm 
which performs this task by using a large 
variety of linguistic constraints. Both the texts 
and the rules (or constraints) are represented in 
the same formalism, that is finite automata. 
Performing subtraction operations between 
text automata and constraint automata reduce 
the ambiguities. Experiments were performed 
on French texts with large scale dictionaries 
(one dictionary of 600.000 simple inflected 
forms and one dictionary of 150.000 inflected 
compounds). Syntactic patterns represented by 
automata, including shapes of compound 
nouns such as Noun followed by an Adjective 
(in gender-number agreement) (Cf 5. I), can be 
matched in texts. 
This process is thus an extension of the classic 
matching procedures because of the on-line 
dictionary consultation and because of the 
grammar constraints. It provides a simple and 
efficient indexing tool. 
2. Motivation 
* This work was supported by DRET and Ecole 
Polytechnique. 
** Universit(~ Marne la Vallre. Institut Gaspard Monge. 2 
Allre Jean Renoir. 93160 Noisy le Grand. France 
eroche@ladl.jussieu.fr Universit6 Paris 7 
Automatic analysis by phrase-structure 
grammar is time COtlsuming. The need for fast 
procedures leads to grammar representations 
that are less powerful but easier to handle than 
general unification procedures. Pereira and 
Wright 1991 and Rimon and Herz 1991 
proposed such approaches, that is, algorithms 
that perform the construction of a finite-state 
automaton approximation of a phrase-structure 
grammar, These automata are then used as 
simple checkers of well-formed patterns. 
However, parsing a sentence and only 
providing the information that it does (or 
doesn't) match the automaton description is 
not sufficient. One should provide (see K. 
Koskenniemi 19901 the readings of the text 
that respect exactly the constraints.We propose 
here an algorithm that provide all these 
readings. Moreover, the autonlatou of a given 
text can be highly ambiguous, and in order to 
increase its adequacy (e.g. to study given 
syntactic patterns), we may want to customize 
it. To achieve such a result, we construct 
automata that eliminate paths irrelevant to the 
given study I. Once this operation was 
performed, significant patterns (like Noun 
Adjective in French) can be extracted, 
Technical terms in many domains take the 
form of sequences such as Noun Adjective, 
Noun de Noun etc. Their recognition thus 
leads to an efficient indexation process. This is 
a complementary approach to statistical 
treatments like those presented in K.Church, 
W. Gale, P. Hanks, D. Hindle 1989 or in N. 
Calzolari and R. Binzi 1991. 
Moreover, we use Finite-State Automata 
(FSA) at all stages of the process: for 
dictionary consultation, for disambiguation 
and for the final extraction process. This 
allowed the experiments to be done on-line 
starting with untagged corpora. 
One of the crucial points is that tagged text 
should be represented by FSA in order to be 
disambiguated (disambiguated texts are already 
in this form in Rimon and Herz 1991 and K. 
Koskenniemi 1990). FSA representation for 
ambiguities representation is not a new 
approach but in our contribution, we 
1Sonic of these paths may correspond to legitimate 
solutions. 
A(.q'ES DE COLING-92, NANTES, 23-28 hOtrV 1992 9 9 3 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
systematized it for different types of 
ambiguities, namely: 
1. Morphological features ambiguity (gender 
for instance), 
2. Part of speech ambiguity, 
3. Phrase ambiguity (compound v.s. sequence 
of simple words). 
3. Presentation of an example 
Let us take, for instance, the French sequence 
(1).. le passe... 
Both words are ambiguous, le can either be an 
article (the) or a pronoun (it, him or her) and 
passe can either be a noun or a verb. Moreover 
the noun passe is still ambiguous, since it can 
mean either a pass key (and is then masculine) 
or a pass (like in a forward pass, it is then 
feminine). The verb form passe, ill turn, is 
ambiguous, it is a conjugated form of the 
canonical form passer (to pass) in one of the 
three tenses: indicative present, subjunctive 
present or imperative present. For the first two 
tenses, it can either be in the first or in tile third 
singular person and, for the latter, it has to be 
in the second person of singular. 
The problem is the following: the consultation 
of the simple form dictionary DELAF 2 
(600.000 entries) first provides a sequence 
tagged as follows: 
le (pronoun, article) passe(noun-ms, noun-fs, 
verb-P3s:S3s:Pls:Sls:Y2s) 
(where the abbreviations are m: masculine, s: singular, 
3: third person, P: present, S: subjunctive, Y: 
imperative) 
The compound form dictionary DELACF 3 
(150.000 entries) is used, it marks sequences 
like pomme de terre (potato) as frozen. In a 
second step, we provide the automaton 
representation of figure 1, to be read in the 
following way: The first word is either a 
pronoun or an "article, its spelling is le, the 
second word is either a singular noun 
(ambiguous: one meaning is masculine (the 
pass key) and tile other feminine (the forward 
pass)) or else a verb conjugated at the 
persons, tenses and numbers specified above. 
2DELAF: LADL's inflected forms dictionary for simple 
words: B. Courmis 1984,1989. 
3DELACF: LADL's inflected forms dictionary for 
compound words: M. Silbcrztein 1989. 
pronoun ~ passe O 
Figure 1 
On the other hand grammar rules provide 
constraints which can be described as 
forbidden sequences. In our example, since 
the clitic sequence is highly constrained (M. 
Gross 1968), the pronoun le can be followed 
either by another pronoun or by a verb. The 
article le cannot be followed by a verb or by a 
feminine noun (except for parts of 
compounds). This set of forbidden sequences 
is described by the automaton of figure 2. 
Figure 2. 
Thus the FSA representing the text according 
to the roles should be the FSA of figure 3. 
Figure 3. 
The problem consists in constructing the 
automaton of figure 3 given those of figures 1 
The reader probably noticed that file rules were 
described as a set of forbidden sequences, 
which is unusual. The formal operation and 
the algorithm are easier to describe with 
negatively defined rules, it is the reason why 
we use this device here. However, given the 
grammar corresponding to the automaton 
representation, the procedure is equivalent to a 
set of rules expressed in a positive, and hence 
more usual way. 
4. The algorithm 
4.1 Formal description of the problem. 
The problem, informally described, can easily 
be specified in the following way: 
ACRES DE COLING-92, NANTES, 23-28 AOt~T 1992 9 9 4 PROC. OF COLING-92. NANTES, AUG. 23-28. 1992 
Given a text, its FSA representation (e.g. 
figure 1) AI is defined by the 5-tuple 
(Alph,QL,il,Fl,dl) which respectively denotes 
its alphabet, its state set, its starting state, its 
final state set and its transition function 4 which 
maps (Ql*Alph) into Q1. Moreover, A1 has 
the property of being acyclic (it is a Directed 
Acyclic Graph (DAG)). The constraints are 
represented by the FSA A2, defined in the 
same way by (Alph,Q2,i2,f2,d2). These 
automata define respectively the regular 
languages LI=L(AI) (i.e. the language 
accepted by A1) and L2=L(A2) (i.e. the 
language accepted by A2) • Since L2 describes 
the set of sequences (or factors) forbidden in 
any word of L1, if A describes the text after 
the filtering, this means that L=L(A) follows 
the condition L = L1 \Alph* L2 Alph* 
This operation on languages will be called 
factor subtraction and will be noted L=L1 
f- L2. At this point, we can define the related 
operation on automata: if LI=L(A1) and 
L2=L(A2) we say that A is the factor 
subtraction of A1 and A2 and note it A= A1 f- 
A2 if L=L1 f- A2 and L=L(A). 
4.2 Informal description of the 
algorithm 
We will first apply the algorithm on a small 
example. Suppose that A1 is the automaton 
represented in figure 4, that A2 is the 
automaton represented in figure 5 and that we 
want to compute AI f- A2. 
c 
Figure 4 
c 
Figure 5 
Each state of the automaton A=At f- A2 will be 
labelled with a state of A 1 and a set of states of 
A2 (i.e a member of the power set of Q2). 
More concretely the automaton A=A 1 f- A2 of 
figure 6 is built in the following way: 
The initial state is labelled (0,{0}), the first 0 
refers to the state 0 of A1 (01 for short). The 
letter a leads, from 01 to the state 11 of AI but 
to nothing in A2, we construct the state 
4The automata are assumed to be detotministic, which is 
not an additional constraint since one can determinize 
them (see Aho, HoperopfL Huffman 1974 for instance). 
(1,{0}) which means that, for a, 0 leads to 1 
in At but that {0} leads to nothing (the empty 
set) in A2 to which we systematically add the 
initial state. On the other hand, d2({0},b)={ 1 } 
to which we add, as for a, the state 0; thus, in 
A, d((0,{O}),b) = (dl(O,b), {O,d2(0,b)})= 
(1,{0,1}). For each state being constructed, 
we list file states it could refer to in A2 and, for 
each of these states, their image by the letter 
being considered. A specific configuration is 
when the state of A being considered has one 
of his label that leads to the final state of A2, it 
means that a complete sequence of A2 has been 
recognized and should then be deleted. This is 
the case if we look at state (2,{0,1,2}) in A: 
d2({0,1,2},b)=\[1,2,3} where 3 is final, thus 
it has no m msition for b, which leads to delete 
the path bbb forbidden by A2. 
Figure 6 
The following algorithm computes A1 f- A2 
l.f\[0l=(il,{i2/) 
2.q=0; 
3.n=l; 
4.F;~; 
5.do 
6. (xt,XO=tIql; 
7. G={i2}; 
8. for each s e Alph so that dl(Xl,S)*O 
9. yl=dl(Xl,S); 
10. for each x'c X2 
11. if d2(x',s)=f 2 
12. G=O; goto 8; 
13. cl~ 
14. G=G U {d2(x',s)}; 
15. elglfor 
16. if 3q'<=(n-1) so that flq'\]=(yl.G) 
17. d(q,s)=q' 
18. else 
19. tln\]=(yl,G);d(q,s)=n; n+=l; 
20. if Yl E F l then F=F U {n}; 
21. cndfoc 
22. q+=l; 
23.while (q<n) 
AcrEs DE COLING-92, NANTES, 23-28 Ao(rr 1992 9 9 S PRO(:.. oF COLING-92, NANTES, Ant. 23-28, 1992 
5. Experimental results 
Given a syntactic pattern (fL,'st Noun followed 
by an Adjective) and a text, we can detect its 
occurrences. We can search the text without 
applying constraints, this provides output 1 
(figures 7 and 8).We can also search it after 
having applied constraints (output 2). We shall 
compare both outputs. For instance, for the 
sentence: 
L'individu n'y est pas perfu comme une valeur 
abstraite et universelle, mais comme un ~tre concret, 
comme le membre d'un ensemble particulier, Iocalisd 
et qui n'eJdste que darts son rapport d cet ensemble. 
the program provides the following matchings: 
(t) (2) 
y est valeur abstraite 
valeur abstraite 6tre concret 
~tre concret ensemble particulier 
dun 
ensemble particulier 
Figure 7 
DELAF \] \[DELACF 
\[compound word Simple words \[dictionary 
dictionary 
570.000 entries \[ 150.000 entries 
v I esen ti°n I-'--->i 
V I F a pr° n° °n I-->Q 
after constraints 
Figure 8 
The program runs in three steps (figure 8). It 
first takes a text and tags it according to the 
two dictionaries. The text is then transformed 
into its FSA representation on which the 
constraints are applied. 
Given a pattern (Noun followed by an 
Adjective), we compare its number of 
ocuurences in both outputs 1 and 2. This will 
give us a measure of the power of the filtering. 
It is worthwhile to point out that the 
experiments were realized on untagged 
corpora, namely that the duration of the 
tagging process is included in the figures given 
in the tables. These experiments were done on 
personal computers5 and it can be seen that the 
5Experiments were done with an IBM PS2 386 25Mhz 
with an OS/2 V1.3 and 8Mb ram. The program is in C. 
time spent is low enough to permit on-line use 
(for compound word enrichment for instance). 
5.1 Searching Noun-Adjective patterns 
First, let us consider the pattern Noun 
Adjective (which is approximately equivalent 
to the English sequence adjective-noun). We 
first tried to search each contiguous pair of 
words whose first element was labelled as a 
noun and whose second element was an 
adjective. This provides the result of the first 
line of figure 9. The first filtering uses the fact 
that, in French, the word and the adjective 
have to agree on their gender and on their 
number. This gives, for the same texts, the 
results of the second line. Third we applied the 
algorithm described above as a second filter, 
this leads to the results of the third line. 
Editorial Article Novel Corpus 
4185 13010b. 369292b 1738115 
bytes (4p) (100p). /1 pa~e) 
NAdj 66 227 3334 31532 
15" 40" 19' Ih25 
NAdj 56 198 2651 28234 
a~eemen 15" 40" 19' lh25 
t 
NAdj 13 40 1277 11125 
agreem.+ 20" 50" 41' 3h05 constr. 
NAdj 10 134 1150 not real counted 
number 6 
Figure 97 
The texts ,are in the form of ASCII files. The 
first one was a magazine editorial of about 1 
page 8, the second one is an article of about 4 
pages 8. The third one is a novel of the French 
19th century writer Jules Verne: Les aventures 
du docteur Ox. The fourth one is a compilation 
of texts with a large amount of law texts. We 
gave, in the last line, the number of patterns 
that should have been detected if the filtering 
had been perfect; this was done by hand. 
6. Conclusion 
6This number is of course obtained by hand, which 
explains why we didn't do it on the fourth text. 
7The simple word morphological dictionary of 570.000 
factorized entries was compressed into 1Mb and the 
150.000 compotmd forms DELACF was compressed into 
2Mb. 
8From Socidt6 Magazine 1989 
ACRES DE COLING-92, NANTES, 23-28 AO~'T 1992 9 9 6 PROC. OF COLING-92, NANTES, AUr. 2.3-28, 1992 
These experiments will be expanded on a 
larger amount of patterns and on various types 
of corpora, but we already think that those we 
presented here show that the method can 
actually be used as a practical tool for easing 
the construction of terminological lists. 
8. Annexe 
Some constraints 
We present here a sample of constraints as they are 
actually implemented. The following set of sequences 
represents paths that have to be deleted from the FSA 
representing the text. This set is to be compiled into a 
FSA before being used by the program. The ?? word 
means that it can be any transition of the state it is 
cmnpared to. \[:or instance the comparison of ?? and the 
character a gives a as result. It leaves empty the 
parameters tltat are specific to the words being matched 
(i.e. the word itself or its canonical form). 
/il/unitllex/n/ 
/il/unit/lerdvP?l??/??/v-t/??/1/??/ 
/il/unitllex/vl??/'!?fl?/v-t/??/2/??/ 
hllunithex/v/??l??l??/v-t,t??/3/P/ 
hl/unitllex/pre/ne/unit/lex/v/??f??f??/v-tl??/lP. ?/ 
/illunitllex/pre/nelunitllex/vP. ?P. ?P. ?/v-tl??12P?/ 
/illunitllex/pre/ne/unil/lex/v/??P?l?. ?/wtl?. ?/3/P/ 
/il/unit/lerdprolle/unitllerdvl??f!?f~. ?/v-t/??ll/??/ 
/il/unilllex/pro/le/unitllex/v/??f??/??/v-tl??/2P. ?/ 
/il/unitllexlpro/le/unitllex/v/??f??f??/v-tf??131P/ 
/il/unit/lex/v/?. ?f~. ?/??/v-tP?/lf??/ 
/il/unitllex/v/'!?f??/??/v-tl?. ?/2\[??/ 
/il/unit/lex/vr??/??/??/v-tf??/3/P/ 
/il/unit/lex/det/ 
/il/unit/lex/adj/ 

Bibliography 
Aho, Alfred V., John E. Hopcroft, Jeffrey D. 
Ullman, 1974. Tile Design and Analysis of 
Computer Algorithms. Addison Wesley, 
467p. 
Calzolari, Nicoletta, Remo Bindi, 1990. 
Acquisition of Lexical Information from a 
Large Textual Italian Corpus. Coling 90, 
Proceedings of the Conference. Helsinki. 
Church, Kenneth, William Gale, Patrick 
Hanks, Donald Hindle, 1989. Parsing, Word 
Associations and Typical Predicate-Argument 
Relations. Internal report. Bell Laboratories, 
Murray Hill. 
Courtois, Blandine, 1984, 1989. DELAS: 
Dictionnaire EIectronique du LADL pour les 
roots simples du franqais, Paris: Rapport 
technique du LADL, Universit6 Paris 7. 
Gross, Maurice. 1968 Grammaire 
transformationnelle du franqais, 1-Syntaxe du 
verbe. Cantil~ne, Paris, 183p. 
Koskenniemi, Kimmo, 1990. Finite-State 
Parsing and Disambiguation. Coling-90. 
Proceedings of the conference. Helsinki. 
Peireira, Fernando C.N., Rebecca N. Wright. 
1991. Finite state approximation of phrase 
structure grammars, 29th Meeting of the 
A.C.L, Proceedings of the conference. 
University of California, Berkeley. 
Revuz Dominique, 1991. Dictionnaires et 
lexiques, m6thodes et algorithmes. PhD 
dissertation. Universit6 Paris 7, Paris, 130p. 
Rimon, Mori, Jacky Herz, 1991. The 
recognition capacity of local syntactic 
constraints. Fifth Conference of European 
Chapter of Association for Computational 
Linguistics. Proceedings of the Conference, 
Berlin. 
Silberztein Max, 1989. Dictionnaires 
61ectroniques et reconnaissance lexicale 
automatique. PhD dissertation, Universit6 
Paris VII, Paris, 175p. 
